datalad / datalad-dataverse Goto Github PK

A DataLad (www.datalad.org) extension to work with Dataverse

License: Other

Makefile 0.22% Python 99.36% Batchfile 0.04% Shell 0.17% Jinja 0.20%

datalad-dataverse's Introduction

DataLad extension for working Dataverse

Dataverse is open source research data repository software that is deployed all over the world as data or metadata repositories. It supports sharing, preserving, citing, exploring, and analyzing research data with descriptive metadata, and thus contributes greatly to open, reproducible, and FAIR science. DataLad, on the other hand, is a data management and data publication tool build on Git and git-annex. Its core data structure, DataLad datasets, can version control files of any size, and streamline data sharing, updating, and collaboration. This DataLad extension package provides interoperablity with Dataverse to support dataset transport to and from Dataverse instances.

Installation

# create and enter a new virtual environment (optional)
$ virtualenv --python=python3 ~/env/dl-dataverse
$ . ~/env/dl-dataverse/bin/activate
# install from PyPi
$ python -m pip install datalad-dataverse

How to use

Additional commands provided by this extension are immediately available after installation. However, in order to fully benefit from all improvements, the extension has to be enabled for auto-loading by executing:

git config --global --add datalad.extensions.load dataverse

Doing so will enable the extension to also alter the behavior the core DataLad package and its commands, from example to be able to directly clone from a Dataverse dataset landing page.

Full-compatibility with Windows requires a git-annex installation of version 10.20230321 (or later).

Summary of functionality provided by this extension

Interoperability between DataLad and Dataverse version 5 (or later).
A add-sibling-dataverse command to register a Dataverse dataset as remote sibling for a DataLad dataset.
A git-annex-remote-dataverse special remote implementation for storage and retrieval of data in Dataverse dataset via git-annex.
These two features combined enable the deposition and retrieveal of complete DataLad dataset on Dataverse, including version history and metadata. A direct datalad clone from a Dataverse dataset landing page is supported, and yields a fully functional DataLad dataset clone (Git repository).

Contributors ✨

Thanks goes to these wonderful people (emoji key):

_{Johanna Bayer} 📖	_{Nadine Spychala} 🚇 📖	_{Benjamin Poldrack} 🚇 💻 📖 🚧 👀 🤔 🔧	_{Adina Wagner} 💻 🤔 🚇 📖 🚧 👀	_{Michael Hanke} 💻 🤔 🚧 🚇 👀 🔧	_enicolaisen 📖	_Roza 📖
_{Kelvin Sarink} 💻	_{Jan Ernsting} 💻	_{Chris Markiewicz} 💻	_{Alex Waite} 🚇 💻 🚧 🔧	_Shammi270787 💻	_{Wu Jianxiao} 💻 👀 📓	_{Laura Waite} 📖
_{Michał Szczepanik} 🚇	_{Benedikt Ehinger} 🐛 🚧

This project follows the all-contributors specification. Contributions of any kind welcome!

Acknowledgements

This DataLad extension was developed with support from the German Federal Ministry of Education and Research (BMBF 01GQ1905), the US National Science Foundation (NSF 1912266), the Helmholtz research center Jülich (RDM challenge 2022), and the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under grant SFB 1451 (431549029, INF project).

datalad-dataverse's People

Contributors

Stargazers

Watchers

Forkers

adswa bpoldrack aqw jsheunis effigies loj likeajumprope shammimore rgbayrak nadinespy mih christian-monch yarikoptic behinger shoeffner

datalad-dataverse's Issues

Add the datalad-dataverse logo into the docs

There is a logo (it can be improved upon, don't hesitate to propose changes) in https://github.com/datalad/datalad-dataverse/blob/master/docs/source/_static/logo.svg. Let's add it to the index of the docs for a friendly documentation landing page

Draft a create-sibling command

We need to figure how needs to look like in principle.
For example: Creating a collection and the dataset (and its subdatasets) within vs. using an existing one. I guess both should be possible. However, using an existing one may imply that we can't put anything in the collections metadata. It may need a different mode of operation rather than just a "create or not".

check if filename validation is needed and if so, write helper

During manual dataset creation, "upload via the web interface" allows to specify a "file path", a "hierarchical directory structure path used to display file organization and support reproducibility". An uploaded file will be placed under the file path given in this path. E.g., uploading "myfile.txt" and supplying "this/is/a/path/" results in the downloaded dataset to be a zip file with the directory tree this/is/a/path/myfile.txt. (dones't affect its visualization on dataverse it seems, there is it a flat list of files). This file path has character restrictions:

Directory Name cannot contain invalid characters. Valid characters are a-Z, 0-9, '_', '-', '.', '', '/' and ' ' (white space).

If we make use of these paths, we need a helper to ensure only valid characters are used. File names seem to be fine with non-valid characters, e.g., I was able to publish a file with an "ü" in the demo instance

Set up all-contributors bot to credit contributions

We previously used @all-contributors, a nice and almost fully automatic way of acknowledging contributions in open source projects.

I propose to set it up for this project as well. Documentation on how to do it is here, but it may require the right set of permissions, so whoever takes this issue on, please get in touch if you need anything. :)

What is Dataverse's idea of metadata?

To figure what we can do in that regard, we should explore https://guides.dataverse.org/en/latest/admin/metadatacustomization.html before the hackathon and a get an idea of what a metadata deposition from datalad (metalad) dataset could look like.

Implement test for (export) of identical files under different names

This should be a non-issue for a normal (non-export) upload -- deduplication is inherent with the annex-key setup. But #18 suggests that complications could arise from a dataset/verse containing multiple redundant copies of the same file under different names.

We should have a test for that.

Add a glossary of dataverse and DataLad terminology into the docs

I think it would be very valuable to define common Dataverse terms (as started in #14) and DataLad terms (could be taken from the handbook's glossary and how they relate in a stand-alone section in the docs.

Implement parameter validation for `add-sibling-dataverse`

Accordingly, there is no upfront failure and no useful message on missing or invalid arguments.

JuelichData quirks

As the integration with JülichData (see #15) is a driving motivation behind this project, let's assemble a list of supported or unsupported features and other peculiarities we might need to keep in mind.

Jülich Data's mission statement is:

Jülich DATA primary use case is serving as a registry* for scholarly data.

The DOIs of published datasets registered with Jülich DATA may point to external repositories, or point to the landing page on Jülich DATA.

The landing page on Jülich DATA gives free and open access to all bibliographic and/or scientific metadata of the respective dataset. All terms of licencing still have to be fulfilled.

* A special type of data repository, storing bibliographic and scientific metadata, but not the data itself. Instead, the data is only linked to from metadata fields. Hybrid forms are possible. Jülich DATA allows for depositing data, but focus is on registry.

Beyond this, their organization docs state:

Every staff member of Research Centre Jülich is (automatically) allowed to add datasets to the Campus Collection, but cannot publish it standalone. Entries are curated by the central library.
Please note: a director might decide you may not add data to the Campus Collection. Rouge entries will be either moved or deleted by its curators.

The consequences arising from this are:

Data upload seems possible via HTTP (limited to 2GB per file), but publication of the uploaded files requires review by the central library team.
Institutional or project level dataverses have to be created with confirming permissions. Trying to create a new dataverse instance results in an XML parson error for normal users. The docs say "decentralized data managers" are allowed to do that.
Datasets at root level of the dataverse installation are not allowed. Trying to click "Add data" under your user account without selecting a collection causes an XML parsing error for normal users.
New datasets can be created in the "Campus" collection https://data.fz-juelich.de/dataverse/campus, they are in draft mode and publication requires review by the central library
Metadatawise, there is additional "FZJ Metadata" with drop down menues "institute" and "PoV IV topic"

Programmatically determine list of dataset "subjects"

Calling pyDataverse.api.NativeApi.create_dataset() requires to specified "subjects". This is not optional, and it cannot be an arbitrary string.

The WebUI exposes a predefined list of identifiers

we have to be able to query this list programmatically, in order to be able to give meaningful advice for composing this mandatory dataset metadata.

Set up a Zenodo.json or CITATION.cff file

Once we made a release to PyPi, we will be archiving the release on Zenodo for preservation and citability. This requires one of two files:

a) Either a zenodo.json file. A minimal example is already in this repo. It would need to contain the names of all contributors as well as some minimal metadata (a useful example could be the zenodo.json file from datalad-osf)
b) Or a CITATION.cff file. This is a new feature from Github, and just a few days ago, Zenodo announced support for it. An example is in the datalad repository. If we go for CITATION.cff, we need to remove the zenodo.json file from this repository, else Zenodo will pick this one up instead.

Documenting experience with Dataverse v4 issues

In case this could prove useful, I am writing down a few notes from my previous experience with creating a collection + dataset using an installation of Dataverse 4 in 2020/2021, on dataverse.nl. This is the dataset in question: https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/R1TNL8

Issue 1

One main issue that I encountered was with uploading the full BIDS dataset via the web portal's HTTP upload interface. I had received a vague error message like "This file already exists" with no reference to the duplicate files. I had to inspect this manually in the file list (and also via API). I found the files in question to be missing, and when trying to upload them again, received the same issue. This was then found to be a known issue in dataverse. Here's just a quick selection of closed issues that highlight the progression of these challenges:

The workaround at that point was to upload the files in compressed format. I don't know why this worked, but it did (after many tries).

Issue 2

The compressed files upload approach also "solved" the other issue that I had, which was that the HTTP uploader couldn't handle more than ~1000 files at a time. However, to be sure that I could make use of the tree view of files in the dataset, I had to compress my subset of files into the same directory structure that it would exist in in the final tree. I didn't have to explicitly specify the relative path of a file with extra metadata. I suspect once dataverse processed the compressed file after upload, there was some internal mapping to the path field in file-specific metadata.

Here is a link to information on the dataverse user docs regarding compressed files: https://guides.dataverse.org/en/5.10.1/user/dataset-management.html#compressed-files

Issue 3

Another issue was bulk download of the whole dataset from the webportal, which resulted in a 414 Request-URI Too Long error. Here's a related issue:

IQSS/dataverse#6943

The suggestion then was to wait for the upgrade to dataverse v5 to address this issue. I waited for a long time, but perhaps they have (hopefully) already addressed this issue and completed the upgrade.

Issue 4

Lastly, I also encountered an issue with the guestbook functionality. It was possible to circumvent the guestbook functionality via the API, which is meant to get people to agree to some data use terms digitally before download. I had to restrict all files in the dataset as a workaround for this issue, and this required manual access granting whenever someone requests a download. This is of course the opposite of what wants to achieve with the guestbook functionality. The guestbook functionality and issue are described here:

Create CONTRIBUTING.md

It would be good to have a CONTRIBUTING.md file added to this repository, similar to the one in the datalad-osf repo.
We should migrate the instructions, from cloning or python environment setup to how to set up a dataverse instance from scratch with Docker in there.

Action items from Juelich's research data management challenge

One of the main causes for this project was commitment to an internal "Research data management challenge" of the Research Center Juelich, in which we aim to provide integration between the imaging core facility data acquisition site and the dataverse instance of the Jülich central library "Jülich Data" (data.fz-juelich.de).
Therefore, the workpackages in the proposal are a source of project aims for this hackathon. Here are excerpts from the introduction and workpackage 3 that are relevant, with added emphasis by me:

"and INM data output will become more findable, accessible, interoperable, and reusable (FAIR) via standardization and integration with Jülich DATA"
WP3: [...] We will partner with the FZJ Central Library to develop a metadata schema and workflow to enable INM-ICF users to programmatically register a dataset with the Jülich Data portal. [...] Based on previous work on the interoperability of this solution with key services, such as the OpenScience Framework (http://docs.datalad.org/projects/osf), we will provide INM-ICF users with software that can create and populate a dataset record on Jülich Data. [...] As a use case for demonstrating the to-be-developed metadata schema and workflow applied to a heterogeneous dataset, we will register the Jülich Dual-Tasking & Aging study, comprising a host of different types of MRI, behavioral and self-report data, with the Jülich DATA portal. Within this project metadata reporting will be restricted to anonymous information, such as acquisition parameters, QC metrics, general study descriptions.

I extract the following out of it:

Software features

Create a dataverse dataset inside of a dataverse collection (see #14 for a delineation of the terms)
Perform a dataset export to the dataverse dataset via their supported upload protocols. There are several supported protocols, but not all of them might be enabled for a given dataverse installation.

Metadata

Check the metadata schemas in dataverse. There are three levels of metadata: Citation Metadata (any metadata that would be needed for generating a data citation and other general metadata that could be applied to any dataset); Domain Specific Metadata (with specific support currently for Social Science, Life Science, Geospatial, and Astronomy datasets); and File-level Metadata (varies depending on the type of data file)
Check for feasibility of metadata extractors for the available schemas

Usecase/Examples

Take a look at the DTA dataset and check how easy the required anonymous information can be extracted
Figure out how this metadata can best be mapped to available dataverse metadata standards
Find out which version of dataverse is run at the central library, and what the supported features are (who can create what, which data upload protocols are enabled, ...).
Make a list of features the Jülich dataverse has or hasn't, check that concepts in #14 apply to it

Overview of resources

(This needs to be migrated into the README before the hackathon, but I'll create it as an issue just to collect infos)

There is a demo dataverse instance at https://demo.dataverse.org. You need to sign up, but its fast and complication free. Instead of searching for an institution, sign up with your email address and a username of your choice.
The instance allows:

Creating new dataverses
Creating new datasets

The API guide is at https://guides.dataverse.org/en/5.10.1/api. Of particular interest may be the section https://guides.dataverse.org/en/5.10.1/api/intro.html#developers-of-integrations-external-tools-and-apps, which is about third party integrations. Among other things, it mentions https://pydataverse.readthedocs.io/en/latest, a Python library to access the Dataverse API’s and manipulating and using the Dataverse (meta)data - Dataverses, Datasets, Datafiles. The open science framework also has a dataverse integration: https://github.com/CenterForOpenScience/osf.io/tree/develop/addons/dataverse.
@jsheunis also found: https://guides.dataverse.org/en/latest/admin/metadatacustomization.html

Determine how a datalad-clone from dataverse would work (user perspective)

The URL for a dataset shown in the browser is something like

http://localhost:8080/dataset.xhtml?persistentId=doi:10.5072/FK2/CHHQWH&version=DRAFT

The same page is pulled with the URL

http://localhost:8080/dataset.xhtml?persistentId=doi:10.5072/FK2/CHHQWH

This seems relatively specific (enough for dataverse) and pretty easy to map to the actual datalad-annex URL we could immediately clone from:

datalad-annex::?type=external&externaltype=dataverse&url=http%3A//localhost%3A8080&doi=doi%3A10.5072/FK2/CHHQWH&encryption=none

DataLad has a mapping mechanism for that. Here is an example from the -next extension

register_config(
    'datalad.clone.url-substitute.webdav',
    'webdav(s):// clone URL substitution',
    description="Convenience conversion of custom WebDAV URLs to "
    "git-cloneable 'datalad-annex::'-type URLs. The 'webdav://' "
    "prefix implies a remote sibling in 'filetree' or 'export' mode "
    "See https://docs.datalad.org/design/url_substitution.html for details",
    dialog='question',
    scope='global',
    default=(
        r',^webdav([s]*)://([^?]+)$,datalad-annex::http\1://\2?type=webdav&encryption=none&exporttree=yes&url={noquery}',
    ),
)

Here are the full docs for this feature: http://docs.datalad.org/en/latest/design/url_substitution.html

Mangle `directoryLabel` on export as necessary to meet dataverse naming rules

When exporting data (exporttree=yes) dataverse seems to remove leading "." chars from hidden direcotries in directoryLabel. Any ideas how to handle this?

Other characters like leading underscore work.

Test pydataverse

https://github.com/gdcc/pyDataverse is an attractive backend. We need to play with it and figure out what it supports, e.g.,

does it work for HTTP?

Check existing python modules for API access

There are at least two:

https://github.com/gdcc/pyDataverse
https://github.com/IQSS/dataverse-client-python

The former seems a little more fitting for us at a first glance, but that needs a bit of playing with it, I guess.

Prepare a dataverse instance to develop against

https://github.com/IQSS/dataverse-docker

Interaction with Dataverse guestbook feature via datalad

A dataverse dataset can have a (customizable) guestbook feature which allows dataset owners to add a set of questions that people have to answer before they can download a specific file or set of files.

As an example, see this data from FZJ on dataverse: https://data.fz-juelich.de/dataset.xhtml?persistentId=doi:10.26165/JUELICH-DATA/T1PKNZ

When selecting "Download" on a file, a popup appears that has to be completed first:

The content of this guestbook can be customized by the dataset owner. It can for example include a link to a data usage agreement and a checkbox that the user has to tick to say that they agree to the DUA. This could be a very useful automated functionality to have for situations where data can be openly shared with the caveat that a log of who accesses it (and that they agreed to the terms) needs to be kept.

However, there are some inefficiencies w.r.t. web-portal vs API access of dataverse data that I have experienced before, but I will have to dig up the details. The summary of the issues I had:

the guestbook has to be filled in for every download, this is inefficient if the user will be downloading many files manually
bulk download is possible via the webportal, but there was a cap on the amount of files. Datasets with many files could therefore not be downloaded in bulk easily via the portal
Download via the api was possible, but it was also possible to circumvent the guestbook feature here

This was for a specific major version of the API (i think 4, but don't bet on that). Hopefully this is not the case anymore currently, but some of these issues might rear their heads again.

Whatever role datalad can play in this domain to make agreeing to the guestbook requirements more seamless would be useful IMO.

Some other links:

access to the guestbook entries via API

Implement credential retrieval for create-sibling-dataverse

Right now, this just gets the token from the test environment.
Proper implementation should rely on CredentialManager from datalad-next and consider all the respective configurations.

In create_sibling_dataverse.py is a to be implemented function _get_api_token for that.

Set up a welcome bot to say hi, thanks, and congrats

There is an app that the datalad handbook uses (copied from how the turing way and the OHBM OSSIG used it): https://github.com/apps/the-welcome-bot

We can reuse this config: https://github.com/datalad-handbook/book/blob/master/.github/config.yml

(Enabling the bot requires elevated permissions, please ping this issue or raise your voice in the chat to get help)

Register in https://github.com/datalad/datalad-extensions

eventually :)

Create a scaffold for the special remote

One of the cornerstones of the project is a special remote implementation that can talk to dataverse. This will need to be implemented based on a scaffold. Some not exactly perfect examples are in datalad core in https://github.com/datalad/datalad/tree/master/datalad/customremotes, and like them, we will use https://github.com/Lykos153/AnnexRemote as a basis.

Fix datalad create-sibling-dataverse's credential parameter docstring

see #77 (comment)

Define and execute standard messaging/error reporting pattern

SpecialRemote.meassage() maybe? @bpoldrack ?

`create-sibling-dataverse` should implement store-credential-on-success approach

See the special remote implementation for how this can be done

datalad-dataverse/datalad_dataverse/remote.py

Line 75 in f33c68f

def api(self):

This is particularly convenient, because once entered and verified to work, a user never has to reenter the credential again (not for download, upload, or further dataset creation).

Right now create-sibling-dataverse does not store credentials at all.

add badge with a link to the documentation to the README

I will provide link and badge shortly, then it can be added to the README

Import data to dataverse

I used this script to import data to dataverse:

export API_TOKEN= **********

export SERVER_URL=https://demo.dataverse.org

export DATAVERSE_ID=root

export PERSISTENT_IDENTIFIER=doi.org/******

curl -H X-Dataverse-key:$API_TOKEN -X POST $SERVER_URL/api/dataverses/$DATAVERSE_ID/datasets/:import?pid=$PERSISTENT_IDENTIFIER&release=yes --upload-file dataset.json

Add the Contributor Covenant Code of Conduct

The DataLad project uses the Contributor Covenant Code of Conduct.
We should adopt it for this project, too. It would be great to add https://github.com/datalad/datalad/blob/master/CODE_OF_CONDUCT.md to this repository.

Enable a read-only special remote for a particular version of a published dataset

IOW support a version parameter for initremote and restrict operations to read-only ones.

Make sure that contributor information is complete and correct

There is a CITATION.cff file with all contributors, added in add5cec.
Before we release, we should make sure that everyone who contributed is listed there.

Bootstrap an empty dataverse instance with a dataset

This could become a test helper:

from pyDataverse.api import NativeApi
# dataverse base URL and admin token
api = NativeApi('http://localhost:8080', '5367f732-36bd-46ed-975a-9eeaeb98fe74')

# metadata for a dataverse (collection)
from pyDataverse.models import Dataverse
dvmeta = Dataverse(dict(
    name="myname",
    alias="myalias",
    dataverseContacts=[dict(contactEmail='[email protected]')]
))
# create under the 'root' collection
api.create_dataverse('root', dvmeta.json()).text

# metadata for a dataset in a collection
from pyDataverse.models import Dataset
dsmeta = Dataset(dict(
    title='mytitle',
    author=[dict(authorName='myname')],
    datasetContact=[dict(
        datasetContactEmail='[email protected]',
        datasetContactName='myname')],
    dsDescription=[dict(dsDescriptionValue='mydescription')],
    subject=['Medicine, Health and Life Sciences']
))
# create dataset in the just-created dataverse (collection)
api.create_dataset('myalias', dsmeta.json()).text

Credit contributions with the allcontributors-bot

#22 is done, and the bot is installed and works (#41). Now we need to add contributors! Please add contributions generously for all contributions you see. The emoji key that lists the possible contribution types is https://allcontributors.org/docs/en/emoji-key and the call looks like this:

hey @all-contributors please add [contributor] for [contrib1, contrib2, contrib3]

Robustify DOI specification for `init|enableremote`

% DATAVERSE_API_TOKEN=5367f732-36bd-46ed-975a-9eeaeb98fe74  git annex initremote dv1 encryption=none type=external externaltype=dataverse url=http://localhost:8080 doi=10.5072/FK2/0NLYPQ
initremote dv1 
git-annex: external special remote error: ERROR: GET HTTP 404 - http://localhost:8080/api/v1/datasets/:persistentId/?persistentId=10.5072/FK2/0NLYPQ. MSG: {"status":"ERROR","message":"Dataset with Persistent ID 10.5072/FK2/0NLYPQ not found."}
failed
initremote: 1 failed

% DATAVERSE_API_TOKEN=5367f732-36bd-46ed-975a-9eeaeb98fe74  git annex initremote dv1 encryption=none type=external externaltype=dataverse url=http://localhost:8080 doi=doi:10.5072/FK2/0NLYPQ
initremote dv1 ok
(recording state in git...)

I think it makes sense to detect a full URL or a plain DOI string and format it correctly as doi:<doistring> for all input styles

Fix Sphinx docbuild

The first issue is an incompatible language specification. A fix is outlined in datalad/datalad#6716.

Needed: default dataset description

Whenever no custom dataset description is available or provided, we still need to provide one to dataverse. The default looks of a dataverse dataset that had a datalad dataset Git repo pushed to it is this:

It would make sense to anchor the default description for this mode on explaining what a datalad dataset is, and how one would work with it, and what the nature of these two files is (and also the similarly looking, equally cryptic annex keys) -- i.e. say that this is not meant to be consumed without datalad.

We already have something similar in https://github.com/datalad-datasets/human-connectome-project-openaccess, which could be tailored for the present needs.

Improve error with missing dataset DOI

Current state:

% DATAVERSE_API_TOKEN=5367f732-36bd-46ed-975a-9eeaeb98fe74  git annex initremote dv1 encryption=none type=external externaltype=dataverse url=http://localhost:8080
initremote dv1 
git-annex: external special remote error: ERROR: GET HTTP 404 - http://localhost:8080/api/v1/datasets/:persistentId/?persistentId=. MSG: {"status":"ERROR","message":"Dataset with Persistent ID  not found."}
failed
initremote: 1 failed

Glossary of Dataverse terms and how they map to DataLad concepts

Dataverse installation: A running, deployed dataverse instance
Dataverse dataset
A dataset is a container for data, documentation, code and the metadata describing it. A DOI is assigned to each dataset.

Datasets have three levels of metadata:
- Citation Metadata: any metadata that would be needed for generating a data citation and other general metadata that could be applied to any dataset;
- Domain Specific Metadata: with specific support currently for Social Science, Life Science, Geospatial, and Astronomy datasets; and
- File-level Metadata: varies depending on the type of data file

Datasets are created inside of dataverse collections. Users need to make sure to create datasets only in dataverse collections they have permissions to create datasets in. Data upload supports several methods (HTTP, Dropbox upload, rsync + ssh, command line DVUploader), but not all of those are supported by each Dataverse installation, and only one method can be used for each dataset:

If there are multiple upload options available, then you must choose which one to use for your dataset. A dataset may only use one upload method. Once you upload a file using one of the available upload methods, that method is locked in for that dataset. If you need to switch upload methods for a dataset that already contains files, then please contact Support by clicking on the Support link at the top of the application.

Dataverse collection
A Dataverse collection is a container for datasets and other Dataverse collections. Users can create new Dataverse collections using "Add Data" -> "New dataverse" (they by default become administrator of that Dataverse collection) and can manage its settings

Dataverse collection can be created for a variety of purposes, the granularity that the dataverse project uses to describe it are "Researcher", "Organization", or "Institution". Jülich data has 28 dataverses which are institutional/project level dataverses. Those dataverses can only be created with confirming permissions by Jülich data, but once existing, they can be organized as the insitutes see fit.
Dataset linking: A dataverse owner can “link” their dataverse to a dataset that exists outside of that dataverse, so it appears in the dataverse’s list of contents without actually being in that dataverse. One can link other users’ datasets to your dataverse, but that does not transfer editing or other special permissions to you. The linked dataset will still be under the original user’s control.
Dataverse linking: Dataverse linking allows a dataverse owner to “link” their dataverse to another dataverse, so the dataverse being linked will appear in the linking dataverse’s list of contents without actually being in that dataverse. Currently, the ability to link a dataverse to another dataverse is a superuser only feature.
Publishing a dataverse: Making a created dataverse public, or making datasets public. At least on JuelichData, the very first version of a dataset is called "DRAFT", the first published version is "1.0"

Not specific to dataverse.org

create-sibling-dataverse docs state:

Create a dataset sibling(-tandem) on dataverse.org.

This is misleading. it should work for any dataverse deployment.

Figure how to provide a "docker export"

Current approach on providing a dataverse-docker relies on docker-compose pulling the respective images over network.
For the hackathon, it may not be feasible for everybody to pull several GB over the site's Wifi.

There should be away to have the setup "exported" to an HD or Stick.

We need to figure how to do that (and how to use that as an alternative starting point, obviously)

Create a project/package description for the index page of the docs

It would be cool if the index of the docs could be stripped of the generic "datalad extension" content, and be amended with some general overview of the project. Stealing content from the brainhack project pitch description is completely fine, and it would be great if it could mention that the project originates from this hackathon.

Declare software dependencies

This extension will require token handling. It makes little sense to me (@mih) to fiddle with the implementation in datalad-core. Instead we should use the latest features in datalad-next (which brings datalad_next >= 0.2.0, latest is 0.3.0).

For the implementation of tests, it makes sense to me (@mih) to use pytest right from the start. However, a datalad version (0.17.0 with the necessary utilities) has not been released yet. This implies an additional dependency on datalad-core's master branch --beyond the dependency link via datalad-next.

Additional dependencies will possibly come via #7.

Be clear about only support dataverse v5+

#18 lists potential problems in previous releases, and (as far as I know) we have never tried working with anything prior v5.

Docker hub is unhappy with our test frequency

ERROR: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
Build exited with code 1

We should limit the runs (maybe disable the branch CI run and just keep the PR one). Or follow the advice.

What is Dataverse's data concept?

We need a good grasp on what's dataverse's notion of dataverses, collections, datasets, files and what data deposition looks like in order to have an idea of what could be done with respect to an datalad export of sorts.

Switch default branch to `main`

n.t.

Cross-linking DataLad-issues

datalad/datalad#393
datalad/datalad#2038

Determine Dataverse API equivalents for git-annex special remote operations

We need to explore what types of operations can be mapped onto the dataverse API in order to see what functionality can be supported at this most basic level. It would be very attractive to use this layer, because it would automatically enable interoperability with dataverse for file deposition, file retrieval, Git history deposition, datalad-clone from dataverse DOIs, and even usage outside datalad with git-annex directly.

Note that git-annex operates strictly in a per-file-version mode, so things like bulk-uploads of ZIP file etc. are not part of the possibilities.

Here is the list of operations:

Must have

TRANSFER STORE: file upload to deposit under a name matching the annex key (git-annex content identifier)
TRANSFER RETRIEVE file download based on an annex key
CHECKPRESENT: report whether file matching a particular annex key is available. It is imported that the implementation does not report presence, when in-fact an upload is still ongoing and not complete.
REMOVE: Delete a file matching a particular annex key

Nice to have

SETURLPRESENT determine a (persistent) URL where file content can be downloaded for a file deposited by TRANSFER STORE
TRANSFEREXPORT STORE: Like TRANSFER STORE but store a file under a specific filename on dataverse
TRANSFEREXPORT RETRIEVE: Like TRANSFER RETRIEVE, but retrieve file content by a file name given to TRANSFEREXPORT STORE rather than an annex key
CHECKPRESENTEXPORT: Like CHECKPRESENT, but check for a filename given to TRANSFEREXPORT STORE
REMOVEEXPORT: Like REMOVE, but remove a file by filename given to TRANSFEREXPORT STORE

To make things more efficient/convenient

REMOVEEXPORTDIRECTORY: remove all files with "logical" paths (as given to TRANSFEREXPORT STORE) inside a particular directory
RENAMEEXPORT: Change the "logical" path of a deposited file

Extremely nice to have

Ability to report a list of files in a dataset with the following information for each file

filename (logical/tree path)
size in bytes
content identifier: and identifier provided by (or computable from information provided by) dataverse that is
- stable, so when a file has not changed, its content identifier remains the same
- changes when a file is modified
- be as unique as possible, but not necessarily fully unique. A hash of the content would be ideal, but a (size, mtime, inode) tuple is better than nothing
- be reasonably short (needs to be tracked)

and possibly the same even for historical version.