Giter Club home page Giter Club logo

datalad-next's Introduction

DataLad NEXT extension

All Contributors Build status codecov docs Documentation Status License: MIT GitHub release PyPI version fury.io DOI

This DataLad extension can be thought of as a staging area for additional functionality, or for improved performance and user experience. Unlike other topical or more experimental extensions, the focus here is on functionality with broad applicability. This extension is a suitable dependency for other software packages that intend to build on this improved set of functionality.

Installation

# create and enter a new virtual environment (optional)
$ virtualenv --python=python3 ~/env/dl-next
$ . ~/env/dl-next/bin/activate
# install from PyPi
$ python -m pip install datalad-next

How to use

Additional commands provided by this extension are immediately available after installation. However, in order to fully benefit from all improvements, the extension has to be enabled for auto-loading by executing:

git config --global --add datalad.extensions.load next

Doing so will enable the extension to also alter the behavior the core DataLad package and its commands.

Summary of functionality provided by this extension

  • A replacement sub-system for credential handling that is able to handle arbitrary properties for annotating a secret, and facilitates determining suitable credentials while minimizing avoidable user interaction, without compromising configurability. A convenience method is provided that implements a standard workflow for obtaining a credential.
  • A user-facing credentials command to set, remove, and query credentials.
  • The create-sibling-... commands for the platforms GitHub, GIN, GOGS, Gitea are equipped with improved credential handling that, for example, only stores entered credentials after they were confirmed to work, or auto-selects the most recently used, matching credentials, when none are specified.
  • A create-sibling-webdav command for hosting datasets on a WebDAV server via a sibling tandem for Git history and file storage. Datasets hosted on WebDAV in this fashion are cloneable with datalad-clone. A full annex setup for storing complete datasets with historical file content version, and an additional mode for depositing single-version dataset snapshot are supported. The latter enables convenient collaboration with audiences that are not using DataLad, because all files are browsable via a WebDAV server's point-and-click user interface.
  • Enhance datalad-push to automatically export files to git-annex special remotes configured with exporttree=yes.
  • Speed-up datalad-push when processing non-git special remotes. This particularly benefits less efficient hosting scenarios like WebDAV.
  • Enhance datalad-siblings enable (AnnexRepo.enable_remote()) to automatically deploy credentials for git-annex special remotes that require them.
  • git-remote-datalad-annex is a Git remote helper to push/fetch to any location accessible by any git-annex special remote.
  • git-annex-backend-XDLRA (originally available from the mihextras extension) is a custom external git-annex backend used by git-remote-datalad-annex. A base class to facilitate development of external backends in Python is also provided.
  • Enhance datalad-configuration to support getting configuration from "global" scope without a dataset being present.
  • New modular framework for URL operations. This framework directly supports operation on http(s), ssh, and file URLs, and can be extended with custom functionality for additional protocols or even interaction with specific individual servers. The basic operations download, upload, delete, and stat are recognized, and can be implemented. The framework offers uniform progress reporting and simultaneous content has computation. This framework is meant to replace and extend the downloader/provide framework in the DataLad core package. In contrast to its predecessor it is integrated with the new credential framework, and operations beyond downloading.
  • git-annex-remote-uncurl is a special remote that exposes the new URL operations framework via git-annex. It provides flexible means to compose and rewrite URLs (e.g., to compensate for storage infrastructure changes) without having to modify individual URLs recorded in datasets. It enables seamless transitions between any services and protocols supported by the framework. This special remote can replace the datalad special remote provided by the DataLad core package.
  • A download command is provided as a front-end for the new modular URL operations framework.
  • A python-requests compatible authentication handler (DataladAuth) that interfaces DataLad's credential system.
  • Boosted throughput of DataLad's runner component for command execution.
  • Substantially more comprehensive replacement for DataLad's constraints system for type conversion and parameter validation.
  • Windows and Mac client support for RIA store access.
  • A next-status command that is A LOT faster than status, and offers a mono recursion mode that shows modifications of nested dataset hierarchies relative to the state of the root dataset. Requires Git v2.31 (or later).

Summary of additional features for DataLad extension development

  • Framework for uniform command parameter validation. Regardless of the used API (Python, CLI, or GUI), command parameters are uniformly validated. This facilitates a stricter separation of parameter specification (and validation) from the actual implementation of a command. The latter can now focus on a command's logic only, while the former enables more uniform and more comprehensive validation and error reporting. Beyond per-parameter validation and type-conversion also inter-parameter dependency validation and value transformations are supported.
  • Improved composition of importable functionality. Key components for commands, annexremotes, datasets (etc) are collected in topical top-level modules that provide "all" necessary pieces in a single place.
  • webdav_server fixture that automatically deploys a local WebDAV server.
  • Utilities for HTTP handling
    • probe_url() discovers redirects and authentication requirements for an HTTP URL
    • get_auth_realm() returns a label for an authentication realm that can be used to query for matching credentials
  • Utilities for special remote credential management:
    • get_specialremote_credential_properties() inspects a special remote and returns properties for querying a credential store for matching credentials
    • update_specialremote_credential() updates a credential in a store after successful use
    • get_specialremote_credential_envpatch() returns a suitable environment "patch" from a credential for a particular special remote type
  • Helper for runtime-patching other datalad code (datalad_next.utils.patch)
  • Base class for implementing custom git-annex backends.
  • A set of pytest fixtures to:
    • check that no global configuration side-effects are left behind by a test
    • check that no secrets are left behind by a test
    • provide a temporary configuration that is isolated from a user environment and from other tests
    • provide a temporary secret store that is isolated from a user environment and from other tests
    • provide a temporary credential manager to perform credential deployment and manipulation isolated from a user environment and from other tests
  • An iter_subproc() helper that enable communication with subprocesses via input/output iterables.
  • A shell context manager that enables interaction with (remote) shells, including support for input/output iterables for each shell-command execution within the context.

Patching the DataLad core package.

Some of the features described above rely on a modification of the DataLad core package itself, rather than coming in the form of additional commands. Loading this extension causes a range of patches to be applied to the datalad package to enable them. A comprehensive description of the current set of patch is available at http://docs.datalad.org/projects/next/en/latest/#datalad-patches

Developing with DataLad NEXT

This extension package moves fast in comparison to the core package. Nevertheless, attention is paid to API stability, adequate semantic versioning, and informative changelogs.

Public vs internal API

Anything that can be imported directly from any of the sub-packages in datalad_next is considered to be part of the public API. Changes to this API determine the versioning, and development is done with the aim to keep this API as stable as possible. This includes signatures and return value behavior.

As an example: from datalad_next.runners import iter_git_subproc imports a part of the public API, but from datalad_next.runners.git import iter_git_subproc does not.

Use of the internal API

Developers can obviously use parts of the non-public API. However, this should only be done with the understanding that these components may change from one release to another, with no guarantee of transition periods, deprecation warnings, etc.

Developers are advised to never reuse any components with names starting with _ (underscore). Their use should be limited to their individual subpackage.

Acknowledgements

This DataLad extension was developed with funding from the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under grant SFB 1451 (431549029, INF project).

Contributors

Michael Hanke
Michael Hanke

πŸ› πŸ’» πŸ–‹ 🎨 πŸ“– πŸ’΅ πŸ” πŸ€” πŸš‡ 🚧 πŸ§‘β€πŸ« πŸ“¦ πŸ“† πŸ‘€ πŸ“’ ⚠️ πŸ”§ πŸ““
catetrai
catetrai

πŸ’» 🎨 πŸ“– πŸ€” ⚠️
Chris Markiewicz
Chris Markiewicz

🚧 πŸ’»
MichaΕ‚ Szczepanik
MichaΕ‚ Szczepanik

πŸ› πŸ’» πŸ–‹ πŸ“– πŸ’‘ πŸ€” πŸš‡ 🚧 πŸ‘€ πŸ“’ ⚠️ βœ… πŸ““
Stephan Heunis
Stephan Heunis

πŸ› πŸ’» πŸ“– πŸ€” 🚧 πŸ“’ πŸ““
Benjamin Poldrack
Benjamin Poldrack

πŸ› πŸ’»
Yaroslav Halchenko
Yaroslav Halchenko

πŸ› πŸ’» πŸš‡ 🚧 πŸ”§
Christian MΓΆnch
Christian MΓΆnch

πŸ’» 🎨 πŸ“– πŸ€” πŸ‘€ ⚠️ πŸ““
Adina Wagner
Adina Wagner

️️️️♿️ πŸ› πŸ’» πŸ“– πŸ’‘ 🚧 πŸ“† πŸ‘€ πŸ“’ ⚠️ βœ… πŸ““
John T. Wodder II
John T. Wodder II

πŸ’» πŸš‡ ⚠️

datalad-next's People

Contributors

adswa avatar bpoldrack avatar catetrai avatar christian-monch avatar effigies avatar jsheunis avatar jwodder avatar mih avatar mslw avatar yarikoptic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datalad-next's Issues

`push` to webdav sibling reported OK, but no actual repo on the remote as a result

# first the normal thunderstorm of warnings, but OK result
% datalad create-sibling-webdav 'https://fz-juelich.sciebo.de/remote.php/dav/files/m.hanke%40fz-juelich.de/stest2'
create_sibling_webdav.storage(ok): /tmp/stest2 (dataset)
[INFO   ] Configure additional publication dependency on "fz-juelich.sciebo.de-storage" 
[INFO   ] Could not enable annex remote fz-juelich.sciebo.de. This is expected if fz-juelich.sciebo.de is a pure Git remote, or happens if it is not accessible. 
[WARNING] Could not detect whether fz-juelich.sciebo.de carries an annex. If fz-juelich.sciebo.de is a pure Git remote, this is expected. Remote was marked by annex as annex-ignore. Edit .git/config to reset if you think that was done by mistake due to absent connection etc 
add-sibling(ok): . (sibling)
action summary:
  add-sibling (ok: 1)
  create_sibling_webdav.storage (ok: 1)

# push is also OK
% datalad push --to fz-juelich.sciebo.de
publish(ok): . (dataset) [refs/heads/master->fz-juelich.sciebo.de:refs/heads/master [new branch]]                                                    
publish(ok): . (dataset) [refs/heads/git-annex->fz-juelich.sciebo.de:refs/heads/git-annex [new branch]]                                              

Navigating to the webdav UI shows no content. Cloning from the remote URL results in empty repo.

Something inside know that things are broken:

image

I got it to work with a datalad push -f gitpush, but not sure this is still causal after this many attempts.

I fail to replicate it now, but it happened for 2-3 attempts. The bad things is that it gave OK results, and only the 'upload-failure' refs gave an indication that there was a problem. This is not good enough.

File tree not reexported after reconfiguring webdav sibling with a new URL

This is not-quite a bug, more of an unexpected behavior in a very specific situation, so I'm describing it here for future reference.

In short: if datalad create-sibling-webdav --mode filetree --existing reconfigure is used to change sibling's url after the data has been pushed, git-annex will not re-export the file tree to the new location.

Discovered with @VeronikaWu.

Steps to reproduce:

Create a dataset with text2git configuration and a text file (details probably won't matter, but we're looking at a non-annexed file here):

datalad create -c text2git my-dataset
cd my-dataset
echo "HELLO" > README.md
datalad save -m "Add readme"

Create a webdav sibling, with exporttree enabled (filetree mode), and push to that sibling:

datalad create-sibling-webdav name=mysibling mode=filetree <URL>/my-dataset
datalad push --to mysibling

Reconfigure that sibling with a url pointing to a different location (on the same server), and repeat push:

datalad create-sibling-webdav name=mysibling mode=filetree --existing reconfigure <URL>/our-dataset
datalad push --to mysibling

Observed behaviour

  • The initial push creates a file tree in the webdav remote (i.e. our README file is visible in the webdav storage)
  • The second push (to new url) does not create the file tree: the README file is not visible in the webdav storage

Expected behaviour

  • The file tree is created in both cases

Probable explanation

The reconfiguration does not change the annex UUID of the remote, only its url. It remains essentially the same thing for git-annex. This can be seen by looking at the history of remote.log on the git-annex branch, e.g. with tig:

git switch git-annex
tig remote.log

Therefore, files not changed since the first push will probably look like having already been exported to the storage remote.

The docs for git-annex-export have these two paragraphs:

Any files in the treeish that are stored on git will also be exported to the special remote.
Repeated exports are done efficiently, by diffing the old and new tree, and transferring only the changed files, and renaming files as necessary.

It is not clear to me whether that diffing is purely local or not, but it might be. And it might actually be a good thing to do it locally, especially if there are many files involved.

How to mitigate

  • Various options of --force and --data arguments given to datalad push did not work for me.
  • Modifying the file and datalad saveing it before pushing leads to re-creating the file tree in the new location (in line with the explanation above), but it is not convenient
  • Removing the sibling and creating a new one from scratch would be a recommended way to export into a new url (if first successful push already happened)

Comment

Taking the above into consideration, we could argue that reconfigure is meant for situations when the place stays essentially the same, and switching to a new location (albeit on the same server) is not the right use case for reconfigure.

I would recommend closing as a wontfix.

Push to filetree WebDAV remote fails

Using Python API on Juseless, with xxx being a filetree mode WebDAV remote...

>>> from datalad import api as dl
>>> dl.push(dataset='.', to='xxx')
Push to 'xxx':  25%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                                                             | 1.00/4.00 [00:00<00:00, 3.29k Steps/sTraceback (most recent call last):50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                                | 2.00/4.00 [00:00<00:00, 3.66k Steps/s]
  File "<stdin>", line 1, in <module>                                                                                                         
  File "/data/group/psyinf/sfb1451/motor-assessment/venv/lib/python3.7/site-packages/datalad/interface/utils.py", line 447, in eval_func
    return return_func(*args, **kwargs)
  File "/data/group/psyinf/sfb1451/motor-assessment/venv/lib/python3.7/site-packages/datalad/interface/utils.py", line 439, in return_func
    results = list(results)
  File "/data/group/psyinf/sfb1451/motor-assessment/venv/lib/python3.7/site-packages/datalad/interface/utils.py", line 369, in generator_func
    allkwargs):
  File "/data/group/psyinf/sfb1451/motor-assessment/venv/lib/python3.7/site-packages/datalad/interface/utils.py", line 544, in _process_results
    for res in results:
  File "/data/group/psyinf/sfb1451/motor-assessment/venv/lib/python3.7/site-packages/datalad/core/distributed/push.py", line 263, in __call__
    got_path_arg=True if path else False)
  File "/data/group/psyinf/sfb1451/motor-assessment/venv/lib/python3.7/site-packages/datalad/core/distributed/push.py", line 598, in _push
    got_path_arg=got_path_arg,
  File "/data/group/psyinf/sfb1451/motor-assessment/venv/lib/python3.7/site-packages/datalad/core/distributed/push.py", line 621, in _push
    got_path_arg=got_path_arg,
  File "/data/group/psyinf/sfb1451/motor-assessment/venv/lib/python3.7/site-packages/datalad_next/patches/push_to_export_remote.py", line 226, in _transfer_data
    "path": str(Path(res_kwargs["path"]) / result["file"])
  File "/usr/lib/python3.7/pathlib.py", line 901, in __truediv__
    return self._make_child((key,))
  File "/usr/lib/python3.7/pathlib.py", line 688, in _make_child
    drv, root, parts = self._parse_args(args)
  File "/usr/lib/python3.7/pathlib.py", line 642, in _parse_args
    a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType

The relevant code fragment where error occurs (line 226) is:

with patch.dict('os.environ', env_patch):
try:
for result in repo._call_annex_records_items_(
[
"export", "HEAD",
"--to", target
],
progress=True
):
yield {
**res_kwargs,
"action": "copy",
"status": "ok",
"path": str(Path(res_kwargs["path"]) / result["file"])
}
except CommandError as cmd_error:
ce = CapturedException(cmd_error)
yield {
**res_kwargs,
"action": "copy",
"status": "error",
"message": str(ce),
"exception": ce
}

I tried the same from my laptop in a python 3.7 virtualenv and it worked, so it shouldn't be a python version.

Could it be git-annex version? Juseless has git-annex version: 8.20210310

clone patching is compatible with maint 0.17.x series but not with master of datalad core

while I was working on unrelated to this fix in core in datalad/datalad#7075 (should have gone against maint, not master) I saw next failing in CI.
ha -- never saw next failing but its error makes little sense to me in relation to these changes:

    ???
/opt/hostedtoolcache/Python/3.7.14/x64/lib/python3.7/site-packages/datalad_next/__init__.py:44: in <module>
    import datalad_next.patches
/opt/hostedtoolcache/Python/3.7.14/x64/lib/python3.7/site-packages/datalad_next/patches/__init__.py:1: in <module>
    from . import (
/opt/hostedtoolcache/Python/3.7.14/x64/lib/python3.7/site-packages/datalad_next/patches/clone.py:30: in <module>
    from .clone_utils import (
/opt/hostedtoolcache/Python/3.7.14/x64/lib/python3.7/site-packages/datalad_next/patches/clone_utils.py:15: in <module>
    from datalad.core.distributed.clone import (
E   ImportError: cannot import name '_get_tracking_source' from 'datalad.core.distributed.clone' (/home/runner/work/datalad/datalad/datalad/core/distributed/clone.py)

it seems have started to fail after recent merge of maint into master: https://github.com/datalad/datalad/actions/runs/3205459877/jobs/5238006971 although succeded BUT in maint in https://github.com/datalad/datalad/actions/runs/3191002029 but before that failed in master in https://github.com/datalad/datalad/actions/runs/3143545627/jobs/5121723826 and was absent in https://github.com/datalad/datalad/actions/runs/3143493738 ... so I guess it was always failing on master ! indeed because patched functions got moved into clone_utils... didn't do more archaeology but it seems that subject is correct.

TODO list WebDAV support

This issue should contain an overview of all issues that should be addressed/fixed before end of April 2022, in order to seamlessly integrate clone and push to WebDAV siblings, which might also require a create-sibling-webav.

Outline of the current implementation strategy:

  1. For each webdav-sibling with name n, create two remotes: n-tree, and n-git. The first is a git-annex special remote of type webdav with exporttree set to yes. The second is a datalad-annex::-special remote that points to the same location as n-tree.
  2. Patch (via datalad-next) the push command to redirect pushes to a special remote with export-tree to git annex export, as described here: datalad/datalad#3127. The export implementation could for example be done via git annex export or via something similar to x-export-to-webdav.
  3. Define a publish dependency between both of them. Then pushing, for example, to n-git could autmatically trigger pushing to n-tree, which leads to an automated push of n-tree.
  4. Integrate them with the new credential system.

Work packages:

  • datalad contains a refactoring of push that isolates the actual data transport into _transfer_data (PR datalad/datalad#6618).

  • datalad_next contains a patch in push_to_export_remote.py that patches _transfer_data. There are currently two alternative patches in two branches and two PRs:
    -- Alternative1: PR #4: use x-export-to-webdav to implement the equivalent of _push_data for exporttree-remotes. This is most likely TOO specific and contains probably TOO much logic (subdatasets). But the existing code can probably be re-used in parts.
    -- Alternative2: PR #5: use plain git annex export to push HEAD to the remote. The current code does currently not perform all checks that _push_data implements. There might be no need for the checks, or there might be need for more checks and logic, for example w.r.t. subdatatasets. That depends on the position of _push_data in the datalad push logic, which I am not familiar with.

  • Ensure that the new credential system is used.

  • Add a siblings-create-webdav that will take a name-base and create two siblings: <name-base>-tree, which is a webdav special remote with export-tree set to yes, and <name-base>-vcs, which is a datalad-annex::-remote which points to the same location as the <name-base>-tree remote. This is done in PR #6

Authentication and credentials:

Replace `create-sibling --storage-sibling`

Status quo

  --storage-sibling {yes|export|only|only-export|no}
                        Both Git history and file content can be hosted on WebDAV. With 'yes', a storage sibling and a Git repository sibling are
                        created ('yes'). Alternatively, creation of the storage sibling can be disabled ('no'), or a storage sibling can be created
                        only and no Git sibling ('only'). The storage sibling can be set up as a standard git-annex special remote that is capable
                        of storage any number of file versions, using a content hash based file tree ('yes'|'only'), or as an export-type special
                        remote, that can only store a single file version corresponding to one unique state of the dataset, but using a human-
                        readable data data organization on the WebDAV remote that matches the file tree of the dataset ('export'|'only-export').
                        When a storage sibling and a regular sibling are created, a publication dependency on the storage sibling is configured for
                        the regular sibling in the local dataset clone. Constraints: value must be one of ('yes', 'export', 'only', 'only-export',
                        'no') [Default: 'yes']

Problems:

  • docs and labels rely on the behavior implied by yes to be the default that needs no explanation, or an explanation that requires technical knowledge
  • the anticipated standard usage is --storage-sibling export

Proposal

--mode {annex|filetree|annex-only|filetree-only|git-only}

Advantages

  • Similar to OSF (albeit not the same, because that one uses exportonly)
  • All alternatives are equally completely specified with their labels (default switching is easier)

Add `datalad.clone.url-substitute.webdav(s)`

This is a mechanism provided by the core package. We could use it, to make the specification of webdav-based datalad-annex:: URLs more convenient.

We could automatically turn

webdav://example.com/path

into

datalad-annex::?type=webdav&encryption=none&url=http://example.com/path

which is much less cryptic enter (analog for https could be webdavs://)

Possibly too much output when cloning from webdavs://

I created a sibling with create-sibling-webdav. I then tried to clone it into another location as follows:

datalad clone webdavs://fz-juelich.sciebo.de/remote.php/dav/files/m.szczepanik%2540fz-juelich.de/test-datalad-nxt nxxt2
[INFO   ] Remote origin uses a protocol not supported by git-annex; setting annex-ignore                                                      
[INFO   ] Set both WEBDAV_USERNAME and WEBDAV_PASSWORD to use webdav                                                                          
[INFO   ] Need to configure webdav username and password.
|   CallStack (from HasCallStack):
|     error, called at ./Remote/WebDAV.hs:323:21 in main:Remote.WebDAV 
install(ok): /home/mszczepanik/Documents/test-webdav/nxxt2 (dataset)

I think the INFO messages may be carried over from development; the CallStack error looks slightly worrying - but the resulting clone looks good.

Edit. That was at the end of my attempts. I think exported with explicit credentials (no realm) as described in #48 but in the meanwhile acquired credentials with realm (another create sibling command to see what's going on).

Annex mode siblings can't deal with credential change

Context: git-annex caches the username & password in read-only plaintext file .git/annex/creds/<annex-uuid>. I wanted to check what would happen if DataLad credentials got updated (whether credentials coming from DataLad take precedence).

What I did:

  • created two siblings (dav1 using filetree mode, dav2 using export mode) on Sciebo using create_sibling_webdav
  • removed Sciebo token used previously, and generated a new one
  • updated the credentials stored by DataLad with datalad credentials set (...) :secret

What I observed:

  • Cloning / getting from either sibling works.
  • Pushing from the original dataset to the filetree sibling works
  • Pushing from the original dataset to the annex sibling fails with the following error
Push to 'dav2':  25%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž                                                            | 1.00/4.00 [00:00<00:00, 10.3k Steps/sCommandError: 'git -c diff.ignoreSubmodules=none annex copy --batch -z --to dav2-storage --fast --json --json-error-messages --json-progress -c annex.dotfiles=true' failed with exitcode 1 under /home/mszczepanik/Documents/test-next/example [info keys: stdout_json]                     
> to dav2-storage...                                                                                                                          
  DAV failure: Status {statusCode = 401, statusMessage = "Unauthorized"} "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<d:error xmlns:d=\"DAV:\" xmlns:s=\"http://sabredav.org/ns\">\n  <s:exception>Sabre\\DAV\\Exception\\NotAuthenticated</s:exception>\n  <s:message>No public access to this resource., Username or password was incorrect, No 'Authorization: Bearer' header found. Either the client didn't send one, or the server is mis-configured, Username or password was incorrect</s:message>\n</d:error>\n" HTTP request: "PUT" "/remote.php/dav/files/m.szczepanik%40fz-juelich.de/webdav-walkthrough/annex-mode/git-annex-webdav-tmp-MD5E-s7576199--5c6bc89919f5e45cc9e7081fa3aac472.jpg"
  DAV failure: Status {statusCode = 401, statusMessage = "Unauthorized"} "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<d:error xmlns:d=\"DAV:\" xmlns:s=\"http://sabredav.org/ns\">\n  <s:exception>Sabre\\DAV\\Exception\\NotAuthenticated</s:exception>\n  <s:message>No public access to this resource., Username or password was incorrect, No 'Authorization: Bearer' header found. Either the client didn't send one, or the server is mis-configured, Username or password was incorrect</s:message>\n</d:error>\n" HTTP request: "PUT" "/remote.php/dav/files/m.szczepanik%40fz-juelich.de/webdav-walkthrough/annex-mode/git-annex-webdav-tmp-MD5E-s7576199--5c6bc89919f5e45cc9e7081fa3aac472.jpg"
  This could have failed because --fast is enabled.
copy: 1 failed

I am wondering what could be the cause, given that the exporttree seems to be the only difference.

GOGS test instance broken

It fails to delete a testrepo in a finally -- this likely means that the underlying test did not even manage to create it before.

======================================================================
1064ERROR: datalad_next.patches.tests.test_create_sibling_ghlike.test_gogs
1065----------------------------------------------------------------------
1066Traceback (most recent call last):
1067  File "/Users/appveyor/venv3.8.12/lib/python3.8/site-packages/nose/case.py", line 198, in runTest
1068    self.test(*self.arg)
1069  File "/Users/appveyor/venv3.8.12/lib/python3.8/site-packages/nose/util.py", line 620, in newfunc
1070    return func(*arg, **kw)
1071  File "/Users/appveyor/venv3.8.12/lib/python3.8/site-packages/datalad/tests/utils.py", line 194, in _wrap_skip_if_no_network
1072    return func(*args, **kwargs)
1073  File "/Users/appveyor/venv3.8.12/lib/python3.8/site-packages/datalad/tests/utils.py", line 874, in _wrap_with_tempfile
1074    return t(*(arg + (filename,)), **kw)
1075  File "/Users/appveyor/venv3.8.12/lib/python3.8/site-packages/datalad/distributed/tests/test_create_sibling_gogs.py", line 35, in test_gogs
1076    check4real(
1077  File "/Users/appveyor/venv3.8.12/lib/python3.8/site-packages/datalad/distributed/tests/test_create_sibling_ghlike.py", line 183, in check4real
1078    requests.delete(
1079  File "/Users/appveyor/venv3.8.12/lib/python3.8/site-packages/requests/models.py", line 960, in raise_for_status
1080    raise HTTPError(http_error_msg, response=self)
1081requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://try.gogs.io/api/v1/repos/dataladtester/dltst_gogsctqqdqjs

And indeed, the underlying error is

[{'action': 'create_sibling_gogs',
  'message': 'initRepository: init repository: mkdir '
             '/home/git/gogs-repositories/dataladtester/dltst_gogs4lmgpppn.git: '
             'no space left on device',
  'refds': '/tmp/datalad_temp_test_gogs4lmgpppn',
  'status': 'error'}]

I verified that there are no stale repositories lying around in our tester account. It seems that the machine ran out of space.

I will disable the test for now, but disabling the credential on appveyor, causing the test to auto-skip.

Pull in `Constraint` upgrades from `-gooey`

Gooey is accumulating a load of updated and enhanced Constraint implementation.
https://github.com/datalad/datalad-gooey/blob/main/datalad_gooey/constraints.py

Ultimately they could be adopted in part by datalad-core, see datalad/datalad#7054. However, until this happens, it would be useful to pull them into -next, as they can improve input handling in commands quite a bit. In particular, they enable a standard and homogeneous (re)validation of input arguments against a dataset context -- once it is known.

What would be a more precise reevaluation for a CLI command call, is in most cases a very first input validation for a Python API call.

Passify datalad when creating webdav siblings

Here is what it looks like right now:

% datalad create-sibling-webdav 'https://fz-juelich.sciebo.de/remote.php/dav/files/m.hanke%40fz-juelich.de/dummy'
create_sibling_webdav.storage(ok): /tmp/dummy (dataset)
[INFO   ] Configure additional publication dependency on "fz-juelich.sciebo.de-storage" 
[INFO   ] Could not enable annex remote fz-juelich.sciebo.de. This is expected if fz-juelich.sciebo.de is a pure Git remote, or happens if it is not accessible. 
[WARNING] Could not detect whether fz-juelich.sciebo.de carries an annex. If fz-juelich.sciebo.de is a pure Git remote, this is expected. Remote was marked by annex as annex-ignore. Edit .git/config to reset if you think that was done by mistake due to absent connection etc 
...

This looks like someone poked into a hornet's nest, but except for the first message (which is remotely informative), this is all expected, and any action taken by a user will likely break something that is working.

Could not enable annex remote fz-juelich.sciebo.de is siblings trying to enable a "pure Git remote", causing this message to appear whether or not it is actually accessible -- and in the standard case we just talked to it, hence we know that it works -- and will still see that message.

The subsequent WARNING is even more annoying, because we know for a fact that there is no annex, and that there is no mistake or absent connection.

Get ZENODO DOI for releases

I would like to be able to cite this software. Could we create a CITATION.cff file or make release on ZENODO to Doi-ify it? Then I can properly reference it in a proceedings contribution that is due on July 15th. Thanks

Tests incompatible with running outside repository

I am running this in my HOME, which is not a Git repo:

======================================================================
ERROR: datalad_next.tests.test_create_sibling_webdav.test_check_existing_siblings
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mih/env/datalad-dev/lib/python3.9/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/mih/hacking/datalad/next/datalad_next/tests/test_create_sibling_webdav.py", line 298, in test_check_existing_siblings
    ds = require_dataset(
  File "/home/mih/hacking/datalad/git/datalad/distribution/dataset.py", line 578, in require_dataset
    raise NoDatasetFound(
datalad.support.exceptions.NoDatasetFound: No dataset found at '/home/mih' for the purpose 'create WebDAV sibling(s)'.  Specify a dataset to work with by providing its path via the `dataset` option, or change the current working directory to be in a dataset.

======================================================================
FAIL: datalad_next.tests.test_create_sibling_webdav.test_http_warning
----------------------------------------------------------------------
datalad.support.exceptions.NoDatasetFound: No dataset found at '/home/mih' for the purpose 'create WebDAV sibling(s)'.  Specify a dataset to work with by providing its path via the `dataset` option, or change the current working directory to be in a dataset.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mih/env/datalad-dev/lib/python3.9/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/mih/hacking/datalad/next/datalad_next/tests/test_create_sibling_webdav.py", line 193, in test_http_warning
    assert_raises_regexp(
AssertionError: "No suitable credential for http://localhost:33322/abc found or specified" does not match "No dataset found at '/home/mih' for the purpose 'create WebDAV sibling(s)'.  Specify a dataset to work with by providing its path via the `dataset` option, or change the current working directory to be in a dataset."

======================================================================
FAIL: datalad_next.tests.test_create_sibling_webdav.test_credential_handling
----------------------------------------------------------------------
datalad.support.exceptions.NoDatasetFound: No dataset found at '/home/mih' for the purpose 'create WebDAV sibling(s)'.  Specify a dataset to work with by providing its path via the `dataset` option, or change the current working directory to be in a dataset.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mih/env/datalad-dev/lib/python3.9/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/mih/hacking/datalad/next/datalad_next/tests/test_create_sibling_webdav.py", line 228, in test_credential_handling
    assert_raises_regexp(
AssertionError: "No suitable credential for https://localhost:22334/abc found or specified" does not match "No dataset found at '/home/mih' for the purpose 'create WebDAV sibling(s)'.  Specify a dataset to work with by providing its path via the `dataset` option, or change the current working directory to be in a dataset."

Example WebDAV workflow

The basic WebDAV workflow (publish - consume with just webdav) would look as follows:

  • datalad create-sibling-webdav to create two siblings (git & storage)
  • set a publication dependency so that gitrepo depends on storage (might be implicit in the above, but totally ok if done explicitly)
  • datalad push --to there
  • from another location, datalad clone (the git sibling)
  • datalad siblings enable storage
  • datalad get files to complete the transfer

My previous write-up of the process with mihextras and git-annex is here

How to do fast-forward checking for `git annex export`?

Context

Before we do a git annex export we attempt to verify that the exported treeish (usually HEAD) is a fast-forward of a treeish that was already exported to the same remote.

That usually means that there is a parent-list from the treeish that should be exported to the treeish that was exported.

If there are no sub-datasets involved, we verify that in a simple way. We execute: git log --pretty="%T" HEAD, which will print out all treeishs that are parents of HEAD. We can just look at whether the exported treeish is in the list of treeishs.

Problem

If we export from a treeish that contains commit-entries. i.e. if the dataset contains sub-datasets, git annex will not record the hash of the treeish, but the hash of the treeish without the commit-entries in its git-annex:export.log.

For example, if HEAD points to a commit with the following tree (hash: 17f0b7cc0832d642d5cb0948efb8f24d6a40930f):

100644 blob 59157cbb56d57fbc0404856778418fdafe2787e5    .datalad/.gitattributes
100644 blob afe5c41394650c36066c10c2e7ac7f9eef173796    .datalad/config
100644 blob af926ef0c359556ac1d36d71f7e173d97b893ff2    .gitattributes
100644 blob 1a57577ff4989f6524a6e1c7f099c8fc3c73ff78    .gitmodules
120000 blob fb1c6ae80773fc672ac81ea7a7988415eb69daa9    file1.txt
120000 blob 368ea635b9f8f468ed5a78b7dc3aa9ef846f02d3    file2.txt
160000 commit 8630b3ddedca63def995acbb2cfd598ffb9b789d  sub1

git annex will record the hash of the following tree (hash: 0ab91416cb5b34f011a830280bcf547d851b749b):

100644 blob 59157cbb56d57fbc0404856778418fdafe2787e5    .datalad/.gitattributes
100644 blob afe5c41394650c36066c10c2e7ac7f9eef173796    .datalad/config
100644 blob af926ef0c359556ac1d36d71f7e173d97b893ff2    .gitattributes
100644 blob 1a57577ff4989f6524a6e1c7f099c8fc3c73ff78    .gitmodules
120000 blob fb1c6ae80773fc672ac81ea7a7988415eb69daa9    file1.txt
120000 blob 368ea635b9f8f468ed5a78b7dc3aa9ef846f02d3    file2.txt

That results in two problems:

  1. Our simple approach does not work anymore to verify the fast-forward-condition. Instead we would have to "clean" all trees from commit entries, create the corresponding hash and verify that.
  2. Since sub-dataset states are not preserved in git-annex's export records, we cannot reliably verify the fast-forward requirement based on the export logs only, because they would not reflect changes in subdatasets

Solutions:

  1. Do not check the fast-forward requirement
  2. Verify the fast-forward requirement based on "cleaned" trees, ignoring sub-dataset changes
  3. Record the "correct" treeish ourselves and use that for fast-forward verification

Intermediate steps:

  • Remove fast-forwad validation

Develop replacement of `Constraint` system

Development targets

  • Assemble all code in a dedicated subpackage (datalad(_next).constraints), comprised of a module for the base and "meta" classes, a module for the primitives (constraints for basic data types), and a module for specialized ones (e.g. EnsureDatasetSiblingName)
  • Extend base class with API for context switching. See what -gooey has done with Constraint.for_dataset()
  • Beyond that, extend "meta" constraints to pass for_dataset() tailoring down to contained constraints.
  • Support creation of a complex compound constraint from a json schema specification that is able to validate a complete form record. The is analog to what -gooey has been done for building a complete constraint for a parameter from a -core Parameter declaration
    https://github.com/datalad/datalad-gooey/blob/c6d59c24f5104693e1e8183b596b41fb32f4fc1c/datalad_gooey/param_form_utils.py#L167-L247
    Ideally, both sets of functionality become accessible via static methods of the Constraint class, i.e. from_parameter(), from_json_schema() datalad/datalad-gooey#393
  • Address datalad/datalad-gooey#311 and thereby have a constraint that can validation a complete command parameterization
  • Implement a "context switching" EnsureMapping() variant that can take another Constraint like EnsureMappings, and internally switch its context, once a value of type Dataset is known to a declare mapping key. Such a constraints can then be used to perform a full context-aware validation of a command parametrization. The base command handling in -core could then be changed to perform any and all validation before a command is called. This would substantially slim implementations of particular commands, homogenize behavior and reduce boilerplate code. Moreover, any command could also declare such an additional, command-level constraint that perform inter-parameter validation, specific for a command (e.g. to ensure or prevent particular parameter (value) combination, e.g. mutual exclusion etc...
  • Allow for CLI generation from a Constraint. -gooey can already auto-generate a GUI for a command based on a single Constraint (per Parameter, ATM, but see above). This shows that the same should be possible for the CLI, and thereby replace the near-duplicate handling in core's CLI module.
  • Add documentation support. ATM the self-documentation possible in constraint types is rather minimal (short|long_description()), We should add support for declaring custom documentation via a constrain constructor. This would take and replace the role of Parameter(doc=)
  • Expose documentation of inter-parameter value validation via generated command help.
  • Switch to structured error reporting datalad/datalad-gooey#394. Compound constraints make it necessary to more precisely identify which component led to an error. Like we need to create a dedicated exception (derived from ValueError) that provide this kind of inference. With it, it should be possible to annotate a specific input field in a form with the respective validation error that cause a single compound validation to fail. We likely need to implement support for some kind of "trace" identifier that can point to an item to a complex conjunction of mappings and sequences. Something like siblings.4.name
  • Addtional language support for value validation. If we auto-generate an HTML form from a constraint, it would be good to support something like javascript-based validation code that could be injected directly into the form for online validation (at least of primitives). This need not live in the Constraint, but it should be possible without major pain.
  • Comprehensive tests
  • Comprehensive documentation: design document and usage
  • Remove support for expand_constraint_spec() which can only handle a handful of data types, and the proposed from_parameter() will handle this, and many other cases. The only place this is used is https://github.com/datalad/datalad/blob/b3ff763c8f1eff6ec6c21c837dd4c37136944fe0/datalad/support/param.py#L76
  • Rewrite EnsureNone based on a new EnsureValue, which can cover any default.
  • Possibly optimize EnsureParameterConstraint.from_parameter() to amend a choice list of an EnsureChoice rather than | Ensure Value to yield a leaner constraint. UNclear yet, if that would have undesirable side-effects

Create sibling webdav asks for credential realm (if it gets one that has no realm specified)

I created a credential to hold my Sciebo token:

datalad credentials set mysciebotoken type=user_password [email protected]

Then I used it with create-sibling-webdav:

datalad create-sibling-webdav -d . -s sciebo --mode filetree --credential mysciebotoken "https://fz-juelich.sciebo.de/remote.php/dav/files/m.szczepanik%40fz-juelich.de/test-datalad-nxt"

User name and password are required for WebDAV access at https://fz-juelich.sciebo.de/remote.php/dav/files/m.szczepanik%40fz-juelich.de/test-datalad-nxt
realm: 

I left the realm empty (hit enter) and everything seemed to work well afterwards. At first I wasn't sure what the realm means (docs for credential command mention "realm" but don't explain). Then I noticed that when a new credential is set during create-sibling-webdav, it gets "https://fz-juelich.sciebo.de/sciebo" realm, which will decide that it'll be used on subsequent clone from that location.

I'm not sure if the prompt about realm is by design.

New feature `record-reproducible` -> `remake`

Underlying goal

Be able to automatically recompute individual file content on demand, when confirmed reproducible.

This is related to datalad/datalad#2850 but more focus on how to encode a "reproducibility record", and do so in a way that is more efficient than one commit-message per reproducible unit.

Exploration of the problem space

Confirming reproducibility implies computing at least twice.

The two computations need not necessarily be identical, e.g. an expensive initial computation might yield all pieces to recompute individual content for a (subset of) file(s) much cheaper on a 2nd run.

The run command seems to have all machinery to implement this feature (but see below how this first impression is not entirely true), it only needs to be amended with a new last processing step that can record the fact that a command reproduced a declared output, and did NOT modify the dataset. Implementing this in run makes this feature easily accessible for containers-run, too.

Given that in general no new dataset state will be committed, the resulting additional run-record cannot live in a commit message. We do have sidecar based records as a possible alternative already.
Their names are content-based, hence will not change, unless the conditions of the execution change.
Such conditions need to be written down in a design document datalad/datalad#6100.

Given that fine grained reproducibility for individual files is desirable, one dedicated run-record for each file is not optimal, due to the large number of record files it would generate. We do have the ability to use custom placeholders already. Their expansion may need to be extended to input and output specs too. This type of expansion would only be necessary in the proposed --record-reproducible mode, and there is no need to alter the semantics or composition of standard run-records. This is also an approach to address datalad/datalad#3075

When such run records need to be expanded per file, the respective parameter value must be stored somewhere. It seems to make sense to annotate the respective annex keys that can be recomputed with this information.

This makes it possible to implement an annex special remote, that can act on this information, and "retrieve" file content via recomputation (which would resolve datalad/datalad#2850).

For each key, the following information must be stored:

  • (A) reference commit defining the worktree state required for a recomputation

    This would in general be the commit the reproducibility test was conducted for. Recording the commit would also allow the special remote implementation to verify whether for other HEAD commits the exact same versions of all declared inputs still match the reproducibility record.

  • (B) identifier of the reproducibility record template

    The template can be committed to a dataset before the annex keys are annotated. This would make the template part of the recorded commit, and enable straightforward retrieval. Alternatively, we could require this template to be an annex'ed file, and just store its annex key.

  • (C) list of parameters for expanding all placeholders in the template

This information could be encoded in a URL that a respective special remote could claim, or the SETSTATE feature could be used to directly store the information, e.g., JSON-encoded.

(B) could be omitted when (A) is a commit with a run-record.

Implementation notes

Given the actual capabilities and quirks of the current implementation the following things require attention:

  • Given that a reproducibility test will not involve the generation of new output files, we must rely on declared outputs to be able to know which keys to annotate at the end
  • The reproducibility test must guarantee that the recorded process actually generates all declared outputs -- this is presently not always the case datalad/datalad#6441.
  • In order to guarantee that, all declared outputs must be removed prior command execution, regardless of their availability
  • For this be be possible in the general case, a reproducibility check must be immediately rejected whenever the expanded inputs overlap with the expanded output declarations.
  • They key annotation must take place after the dataset state was saved, and after it could be verified that, at most, a sidecar files was created (see datalad/datalad#6437 for the difficulties of doing this prior saving)
  • It is OK to record the commit after a sidecar record was saved, and it simplifies lookup/verification of necessary components for a re-compute. In the worst case, a full checkout of this commit can be done recursively to re-compute a key
  • Implementing this feature within run not only makes it accessible to containers-run, but also enables an less complex implementation before datalad/datalad#6438 is resolved.
  • A partially populated dataset will prevent the input declarations from being validity-tested. The documentation should state that.
  • When producing the run-record, inputs and other placeholders must be stored pristine (non-expanded/subtituted)
  • The annex key annotation must include all effective values for placeholders at the time of the reproducibility test. Whether or not inputs/outputs need to be fully expanded, can likely be indicated via the existing --expand argument

TODO

  • Clarify and test details of a parameterization of a reproducibility record. What needs to be expanded, what must not be expanded?
  • Think about declaring multiple means to produce a specific content (eg, on different platforms, with different tools).

git-annex > 10.20220504 breaks `git-clone` from `datalad-annex::...exporttree=yes` (any special remote type)

TL;DR:

git-annex now forcefully and unconditionally distrusts exporttree=yes special remotes. This makes it seemingly impossible for the datalad-annex:: remote helper implementation to use annex fsck to probe a specific key as a remote.


Mainline fails https://ci.appveyor.com/project/mih/datalad-next/build/job/98jegq2gaw7i04rg

eq_(ds.repo.get_hexsha(ds.repo.get_corresponding_branch()),
1386AssertionError: 'b1c9aef80331c775dc06372d5cb2536543a15455' != None
1387

Possibly a git version issue. Does not replicate for me locally.

Update: Nope, git version in both cases is 2.36.1

Looking into the failure of datalad_next.tests.test_create_sibling_webdav.test_common_workflow_export specifically: this is near-identical to datalad_next.tests.test_create_sibling_webdav.test_common_workflow_explicit_cred (which passes), but uses the --filetree mode for the webdav sibling. The dataset clone the respective check is failing on, indeed contains no commits.

Rerunning the clone call does not produce any messages.

Manually looking at the repository deposited via push on webdav (and the underlying filesystem location) reveals no problem -- looks good with all expected content/branches/commits present.

So the issue is clone from a filetree-mode deposit.

Appveyor uses a git-annex snapshot git-annex version: 10.20220505-g2df856370, locally I have git-annex version: 10.20220504

Deploying this snapshot locally also triggers the failure. This points fair and square to a behavior change in git-annex past 10.20220504.

Git-annex did indeed do two export-related changes since that release:

  * Special remotes with importtree=yes or exporttree=yes are once again
    treated as untrusted, since files stored in them can be deleted or
    modified at any time.
    (Fixes a reversion in 8.20201129)
  * Special remotes using exporttree=yes and/or importtree=yes now
    checksum content while it is being retrieved, instead of in a separate
    pass at the end.

The breakage is also detected by the more direct test of the datalad-annex remote helper: datalad_next.gitremote.tests.test_datalad_annex.test_export_remote also failing to locate any content in a clone, immediately after cloning. Here is a demonstrator:

% datalad create --no-annex myds
% cd myds
% git remote add dla 'datalad-annex::file:///tmp/myremote?type=directory&directory={path}&encryption=none&exporttree=yes'
% git push -u dla

# seems all good, remote deposit looks good

% cd ..
% git clone 'datalad-annex::file:///tmp/myremote?type=directory&directory={path}&encryption=none&exporttree=yes' myclone
Cloning into 'myclone'...
warning: You appear to have cloned an empty repository.

# looking at the internals
% git -C .git/dl-repoannex/origin/mirrorrepo log
fatal: your current branch 'master' does not have any commits yet

Diving into the code, it seems that the remote helper fails to detect a deposit at the remote. This leads to an empty local mirror repository. The immediate cause of this misdetection is caused by

[DEBUG] Finished ['git', '-c', 'diff.ignoreSubmodules=none', 'annex', 'fsck', '-f', 'origin', '--fast', '--key', 'XDLRA--refs', '--debug', '-c', 'annex.dotfiles=true'] with status 1 

It exists with code 1 indicating that there would be no such key.

Protocol of a debugging detour that led nowhere However, running the equivalent command again, reports what is to be expected: success.
% git -C myclone/.git/dl-repoannex/origin/repoannex annex fsck -f origin --fast --key XDLRA--refs
fsck XDLRA--refs ok
(recording state in git...)

This discrepancy does not appear to be caused by a race conditon. Adding a 30s wait before the command does not change the result. At the same time, immediately running the command line equivalent always succeeds.

Matching the command line call and the internal execution more closely with

% git -C myclone/.git/dl-repoannex/origin/repoannex -c diff.ignoreSubmodules=none annex fsck -f origin --fast --key XDLRA--refs --debug -c annex.dotfiles=true

does not make a difference, also succeeds.

Conclusion: The reason why I did not see that outside that session was simply using an older git-annex in the other shell :(

Going deeper, here is what that fsck call yields internally. I reformatted the output to make things a bit clearer:

[DEBUG] Finished ['git', '-c', 'diff.ignoreSubmodules=none', 'annex', 'fsck', '-f', 'origin', '--fast', '--key', 'XDLRA--refs', '--debug', '-c', 'annex.dotfiles=true'] with status 1 
[Level 8] CommandError: 'git -c diff.ignoreSubmodules=none annex fsck -f origin --fast --key XDLRA--refs --debug -c annex.dotfiles=true' failed with exitcode 1 under /home/mih/hacking/datalad/next/tmp/myclone/.git/dl-repoannex/origin/repoannex [out: 'fsck XDLRA--refs (fixing location log) 

  (Avoid this check by running: git annex dead --key )
failed
(recording state in git...)'] [err: '....']

The err output I cut from the listing is elaborate and is primarily the XDLRA backend process talking:

Importantly, it contains the output:
  Only these untrusted locations may have copies of XDLRA--refs
        0cc19d93-2885-44c8-83ba-30c2dd3afe40 -- [origin]
  Back it up to trusted locations with git-annex copy.

which directly related to a recent change in git-annex.

``` [2022-06-09 12:58:40.104765007] (Utility.Process) process [522424] read: git ["--git-dir=.","--literal-pathspecs","-c","annex.debug=true","-c","annex.dotfiles=true","show-ref","git-annex"] [2022-06-09 12:58:40.107326093] (Utility.Process) process [522424] done ExitSuccess [2022-06-09 12:58:40.107800842] (Utility.Process) process [522425] read: git ["--git-dir=.","--literal-pathspecs","-c","annex.debug=true","-c","annex.dotfiles=true","show-ref","--hash","refs/heads/git-annex"] [2022-06-09 12:58:40.110214871] (Utility.Process) process [522425] done ExitSuccess [2022-06-09 12:58:40.110383183] (Utility.Process) process [522426] read: git ["--git-dir=.","--literal-pathspecs","-c","annex.debug=true","-c","annex.dotfiles=true","log","refs/heads/git-annex..7fef236de205c1dedd65230dbf54b62eacb50922","--pretty=%H","-n1"] [2022-06-09 12:58:40.112980831] (Utility.Process) process [522426] done ExitSuccess [2022-06-09 12:58:40.117993862] (Utility.Process) process [522427] chat: git ["--git-dir=.","--literal-pathspecs","-c","annex.debug=true","-c","annex.dotfiles=true","hash-object","-w","--stdin-paths","--no-filters"] [2022-06-09 12:58:40.1183213] (Utility.Process) process [522428] chat: git ["--git-dir=.","--literal-pathspecs","-c","annex.debug=true","-c","annex.dotfiles=true","cat-file","--batch"] [2022-06-09 12:58:40.118508819] (Utility.Process) process [522429] feed: git ["--git-dir=.","--literal-pathspecs","-c","annex.debug=true","-c","annex.dotfiles=true","update-index","-z","--index-info"] [2022-06-09 12:58:40.118696224] (Utility.Process) process [522430] read: git ["--git-dir=.","--literal-pathspecs","-c","annex.debug=true","-c","annex.dotfiles=true","diff-index","--raw","-z","-r","--no-renames","-l0","--cached","refs/heads/git-annex","--"] [2022-06-09 12:58:40.121375579] (Utility.Process) process [522430] done ExitSuccess [2022-06-09 12:58:40.121699292] (Utility.Process) process [522429] done ExitSuccess [2022-06-09 12:58:40.122057043] (Annex.Branch) read remote.log [2022-06-09 12:58:40.122743897] (Annex.Branch) read trust.log [2022-06-09 12:58:40.122928676] (Annex.Branch) read numcopies.log [2022-06-09 12:58:40.123825612] (Utility.Process) process [522431] chat: /home/mih/env/datalad-dev/bin/git-annex-backend-XDLRA [] [2022-06-09 12:58:40.123895201] (Annex.ExternalAddonProcess) /home/mih/env/datalad-dev/bin/git-annex-backend-XDLRA[1] <-- GETVERSION [Level 5] Instantiating ssh manager [Level 5] Done importing main __init__ [Level 5] Importing support.network [Level 5] Done importing support.network [Level 5] Requested to provide version for cmd:git [Level 9] Updated GIT_DIR to /home/mih/hacking/datalad/next/tmp/myclone/.git/dl-repoannex/origin/repoannex [DEBUG] Run ['git', 'version'] (cwd=None) [Level 8] Process 522441 started [Level 5] STDERR: git version (Thread<(STDERR: git version, 5)>) started [Level 5] STDOUT: git version (Thread<(STDOUT: git version, 3)>) started [Level 5] process_waiter (Thread) started [Level 5] Read 19 bytes from 522441[stdout] [Level 5] STDOUT: git version (Thread<(STDOUT: git version, 3)>) exiting (exit_requested: False, last data: None) [Level 5] process_waiter (Thread) exiting [Level 5] STDERR: git version (Thread<(STDERR: git version, 5)>) exiting (exit_requested: False, last data: None) [Level 8] Process 522441 exited with return code 0 [DEBUG] Finished ['git', 'version'] with status 0 [Level 8] No module named 'datalad_deprecated' [gitrepo.py::3737] [DEBUG] Not retro-fitting GitRepo with deprecated symbols, datalad-deprecated package not found [DEBUG] Apply datalad-next patch to annexrepo.py:AnnexRepo.enable_remote [Level 5] Importing dataset [Level 5] Done importing dataset [Level 8] No module named 'requests_ftp' [http.py::62] [DEBUG] Failed to import requests_ftp, thus no ftp support: ModuleNotFoundError(No module named 'requests_ftp') [Level 5] Importing datalad.interface.results [Level 5] Done importing datalad.interface.results [DEBUG] Apply datalad-next patch to create_sibling_ghlike.py:_GitHubLike._set_request_headers [DEBUG] Building doc for [DEBUG] Building doc for [DEBUG] Building doc for [DEBUG] Patching datalad.core.distributed.push._transfer_data [DEBUG] Patching datalad.core.distributed.push.Push docstring and parameters [DEBUG] Building doc for [DEBUG] Patching datalad.support.AnnexRepo.get_export_records (new method) [DEBUG] Patching datalad.core.distributed.push._push [DEBUG] Building doc for [DEBUG] Building doc for [DEBUG] Apply datalad-next patch to siblings.py:_enable_remote [2022-06-09 12:58:40.452792385] (Annex.ExternalAddonProcess) /home/mih/env/datalad-dev/bin/git-annex-backend-XDLRA[1] --> VERSION 1 [2022-06-09 12:58:40.452952653] (Annex.ExternalAddonProcess) /home/mih/env/datalad-dev/bin/git-annex-backend-XDLRA[1] <-- CANVERIFY [2022-06-09 12:58:40.453240061] (Annex.ExternalAddonProcess) /home/mih/env/datalad-dev/bin/git-annex-backend-XDLRA[1] --> CANVERIFY-YES [2022-06-09 12:58:40.453281514] (Annex.ExternalAddonProcess) /home/mih/env/datalad-dev/bin/git-annex-backend-XDLRA[1] <-- ISSTABLE [2022-06-09 12:58:40.453345299] (Annex.ExternalAddonProcess) /home/mih/env/datalad-dev/bin/git-annex-backend-XDLRA[1] --> ISSTABLE-NO [2022-06-09 12:58:40.453373478] (Annex.ExternalAddonProcess) /home/mih/env/datalad-dev/bin/git-annex-backend-XDLRA[1] <-- ISCRYPTOGRAPHICALLYSECURE [2022-06-09 12:58:40.453512855] (Annex.ExternalAddonProcess) /home/mih/env/datalad-dev/bin/git-annex-backend-XDLRA[1] --> ISCRYPTOGRAPHICALLYSECURE-NO [2022-06-09 12:58:40.471669769] (Annex.Branch) read export.log [2022-06-09 12:58:40.472150955] (Utility.Process) process [522450] read: git ["--git-dir=.","--literal-pathspecs","-c","annex.debug=true","-c","annex.dotfiles=true","diff-tree","-z","--raw","--no-renames","-l0","-r","4b825dc642cb6eb9a060e54bf8d69288fbee4904","7f0e7953e93b4c9920c2bff9534773394f3a5762","--"] [2022-06-09 12:58:40.476549694] (Utility.Process) process [522451] chat: git ["--git-dir=.","--literal-pathspecs","-c","annex.debug=true","-c","annex.dotfiles=true","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)"] [2022-06-09 12:58:40.479317776] (Utility.Process) process [522452] chat: git ["--git-dir=.","--literal-pathspecs","-c","annex.debug=true","-c","annex.dotfiles=true","cat-file","--batch"] [2022-06-09 12:58:40.481855753] (Utility.Process) process [522450] done ExitSuccess [2022-06-09 12:58:40.486404384] (Annex.Branch) read 3f7/4a3/XDLRA--refs.log [2022-06-09 12:58:40.48672456] (Annex.Branch) read 3f7/4a3/XDLRA--refs.log [2022-06-09 12:58:40.487012582] (Annex.Branch) set 3f7/4a3/XDLRA--refs.log [2022-06-09 12:58:40.48704974] (Annex.Branch) read 3f7/4a3/XDLRA--refs.log [2022-06-09 12:58:40.487141518] (Annex.Branch) read uuid.log Only these untrusted locations may have copies of XDLRA--refs 0cc19d93-2885-44c8-83ba-30c2dd3afe40 -- [origin] Back it up to trusted locations with git-annex copy. [2022-06-09 12:58:40.487435721] (Annex.Branch) read activity.log [2022-06-09 12:58:40.487713107] (Annex.Branch) set activity.log [2022-06-09 12:58:40.488282595] (Utility.Process) process [522453] feed: git ["--git-dir=.","--literal-pathspecs","-c","annex.debug=true","-c","annex.dotfiles=true","update-index","-z","--index-info"] [2022-06-09 12:58:40.492162749] (Utility.Process) process [522453] done ExitSuccess [2022-06-09 12:58:40.492545997] (Utility.Process) process [522454] read: git ["--git-dir=.","--literal-pathspecs","-c","annex.debug=true","-c","annex.dotfiles=true","show-ref","--hash","refs/heads/git-annex"] [2022-06-09 12:58:40.495228312] (Utility.Process) process [522454] done ExitSuccess [2022-06-09 12:58:40.495630629] (Utility.Process) process [522455] read: git ["--git-dir=.","--literal-pathspecs","-c","annex.debug=true","-c","annex.dotfiles=true","write-tree"] [2022-06-09 12:58:40.500163285] (Utility.Process) process [522455] done ExitSuccess [2022-06-09 12:58:40.500551135] (Utility.Process) process [522456] chat: git ["--git-dir=.","--literal-pathspecs","-c","annex.debug=true","-c","annex.dotfiles=true","commit-tree","4c77d0557cd40fc90b148e4c37609b08862b13a2","--no-gpg-sign","-p","refs/heads/git-annex"] [2022-06-09 12:58:40.503226831] (Utility.Process) process [522456] done ExitSuccess [2022-06-09 12:58:40.503454986] (Utility.Process) process [522457] call: git ["--git-dir=.","--literal-pathspecs","-c","annex.debug=true","-c","annex.dotfiles=true","update-ref","refs/heads/git-annex","8ae7bfb27c860bab05cee763f7a9684d3b40840c"] [2022-06-09 12:58:40.506732789] (Utility.Process) process [522457] done ExitSuccess [2022-06-09 12:58:40.507288254] (Utility.Process) process [522452] done ExitSuccess [2022-06-09 12:58:40.50747011] (Utility.Process) process [522428] done ExitSuccess [2022-06-09 12:58:40.507759388] (Utility.Process) process [522451] done ExitSuccess [2022-06-09 12:58:40.508061665] (Utility.Process) process [522427] done ExitSuccess fsck: 1 failed' ```
% git -C myclone/.git/dl-repoannex/origin/repoannex -c diff.ignoreSubmodules=none annex fsck -f origin --fast --key XDLRA--refs -c annex.dotfiles=true
fsck XDLRA--refs 
  Only these untrusted locations may have copies of XDLRA--refs
        95652ad0-00d9-4df5-8f57-c6393772d540 -- [origin]
  Back it up to trusted locations with git-annex copy.

  (Avoid this check by running: git annex dead --key )
failed
(recording state in git...)
fsck: 1 failed

It is impossible to override this trust:

% git -C myclone/.git/dl-repoannex/origin/repoannex -c diff.ignoreSubmodules=none annex trust origin --force
trust origin 
  This remote's trust level is overridden to untrusted.
ok
(recording state in git...)

The recommended way to disable the check does not work:

% git -C myclone/.git/dl-repoannex/origin/repoannex -c diff.ignoreSubmodules=none annex dead --key XDLRA--refs                                       
dead XDLRA--refs 
git-annex: This key is still known to be present in some locations; not marking as dead.
failed
dead: 1 failed

`push` to export fails with old git-annex

Quick notes on seeing it live in a screenshare. git-annex was 8.20200617. push (as patched with -next failed at

result_adjusted = \
annexjson2result(result, ds, **res_kwargs)

because that version of git-annex emitted a JSON record with {..., file: None} (unclear wheter it was a real None or the string "None".

The patched push triggered it, but this might be really a bug in annexjson2result() from core... Although

    if d.get('file'):
        res['path'] = str(ds.pathobj / PurePosixPath(d['file']))

the failing code (i.e. Path / Path) should only be reached if there is no path.

Upgrading to a recent git-annex changed its behavior and no longer triggers the issue.

@mslw may have additional info.

Given that we have little chance to debug this, we might want to close this with a warning if such an old version is used in that code. The datalad-annex:: code might also provide an even stricter version requirement that also deserves such a check.

Consolidate matching siblings-finding between create_sibling_ria and create_sibling_webdav

create_sibling_ria has code to determine matching siblings. Currently create_sibling_webdav has a function that does similar things, i.e. _yield_ds_w_matching_siblings.

Once the functions is perfected, the implementations should be consolidated. That means the function should probably move to core and be used by both sibling implementations.

(This issue replaces a TODO in create_sibling_webdav.py

Documentation snippet on datalad-annex:: use with S3 remote

In this case git-annex leaves a config file with the uuid of the special remote configuration in the bucket. In normal operation, the uuid used by datalad-annex:: will always be different, because the entire setup is throw-away. This causes git-annex to refuse operation.

This can be circumvented by setting an explicit uuid= parameter in the special remote configuration (via the datalad-annex:: URL). Any random, but constant UUID will make things work.

This should be added to the documentation.

Alternatively, the implementation could be changed to always use a constant UUID for any connection. However, I am not convinced that this solves more problems than it creates, because it would render any such protections (accidental (re-)use via two different setups ineffective).

`credentials` command should get a `cleanup` mode

In some cases, secrets will get removed from an underlying credentials store. This leaves the remaining credential properties in the datalad config around, without any immediate purpose.

I think it would be nice to implement clean to wipe out all such credential configs.

Exception not handled correctly when pushing to exporttree remote on newer Git / git-annex versions

I found it after upgrading git-annex with datalad-installer (related to #61), but I believe it's related only to the Git version, which was also brought by the installer.

With Git 2.36.0, git-annex 10.20220504, and datalad-next 0.2.2, pushing to an exporttree remote (created with create_sibling_webdav --mode filetree) produced:

CommandError: 'git -c diff.ignoreSubmodules=none cat-file blob git-annex:export.log' failed with exitcode 128
fatal: path 'export.log' does not exist in 'git-annex'

This is because the error message produced by git cat-file changed somewhere between Git 2.30 ("fatal: Not a valid object name git-annex:export.log") and 2.36 ("fatal: path 'export.log' does not exist in 'git-annex'").

The code which handles the error and checks its message is here:

try:
for line in repo.call_git_items_(["cat-file", "blob", "git-annex:export.log"]):
result_dict = dict(zip(
[
"timestamp",
"source-annex-uuid",
"destination-annex-uuid",
"treeish"
],
line.replace(":", " ").split()
))
result_dict["timestamp"] = float(result_dict["timestamp"][:-1])
yield result_dict
except CommandError as command_error:
# Some errors indicate that there was no export yet.
expected_errors = (
"fatal: Not a valid object name git-annex:export.log",
)
if command_error.stderr.strip() in expected_errors:
return
raise

I'll post a fix (adding the new message) soon.

Push to webdav sibling did not create export tree, despite exporttree=yes

This is a user report from @arefks; similar to #110 but this time in a simpler situation (without reconfiguring).

A repository was configured with text2git, and had several text files, including a subdirectory. A webdav sibling was created using create-sibling-webdav. A push operation created .datalad and .datalad/dotgit, but no file tree on Sciebo.

Although there were no annexed files, the tree should have been created regardless by git-annex-export:

Any files in the treeish that are stored on git will also be exported to the special remote.

The git-annex's remote configuration on the pushing machine looks fine:

> git cat-file blob git-annex:remote.log
ca8906c7-41dd-4f25-abc2-7a2a448c9f01 encryption=none exporttree=yes name=sciebo-storage type=webdav url=<redacted> timestamp=1665654839.8830946s

but the export.log was not created

> git cat-file blob git-annex:export.log
fatal: path 'export.log' does not exist in 'git-annex'

git-annex metadata based data dictionaries

Those a essentially mappings from a set of keys/terms to URI defining the underlying concepts.

This is compatible and suitable for git-annex metadata. The big advantage of using this instead of an explicit file is collaboration.

Merge a contribution to a json based declaration is a nightmare when conflicts arise. git-annex merges distributed metadata changes just fine.

`create_sibling_webdav` must render sibling name somehow

Right now it is nowhere, except for the warnings #30

add-sibling(ok): . (sibling)

The source material is not very rich:

% datalad -f json_pp create-sibling-webdav 'https://fz-juelich.sciebo.de/remote.php/dav/files/m.hanke%40fz-juelich.de/dummies/2'
{
  "action": "create_sibling_webdav.storage",
  "path": "/tmp/restest",
  "status": "ok",
  "type": "dataset"
}
{
  "action": "add-sibling",
  "annex-ignore": true,
  "datalad-publish-depends": "fz-juelich.sciebo.de-storage",
  "fetch": "+refs/heads/*:refs/remotes/fz-juelich.sciebo.de/*",
  "name": "fz-juelich.sciebo.de",
  "path": "/tmp/restest",
  "refds": "/tmp/restest",
  "status": "ok",
  "type": "sibling",
  "url": "datalad-annex::?type=webdav&encryption=none&exporttree=no&url=https%3A//fz-juelich.sciebo.de/remote.php/dav/files/m.hanke%2540fz-juelich.de/dummies/2"
}

Only the second one has a name.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.