DataLad Dataverse extension

Welcome to the DataLad-Dataverse project of the OHBM 2022 Brainhack!

What do we want to do during this Brainhack? Dataverse is open source research data repository software that is deployed all over the world in data or metadata repositories. It supports sharing, preserving, citing, exploring, and analyzing research data with descriptive metadata, and thus contributes greatly to open, reproducible, and FAIR science. DataLad, on the other hand, is a data management and data publication tool build on Git and git-annex. Its core data structure, DataLad datasets, can version control files of any size, and streamline data sharing, updating, and collaboration. In this hackathon project, we aim to make DataLad interoperable with Dataverse to support dataset transport from and to Dataverse instances. To this end, we will build a new DataLad extension datalad-dataverse, and would be delighted to welcome you onboard of the contributor team.

SKILLS

We plan to start from zero with this project, and welcome all kinds of contributions from various skills at any level. From setting up and writing documentation, discussing relevant functionality, or user-experience-testing, to Python-based implementation of the desired functionality and creating real-world use cases and workflows. Here is a non-exhaustive list of skills that can be beneficial in this project:

You have used a Dataverse instance before and/or have access to one, or you are interested in using one in the future
You know technical details about Dataverse, such as its API, or would have fun finding out about them
You know Python
You have experience with the Unix command line
You are interested in creating accessible documentation
You are interested in learning about the DataLad ecosystem or the process of creating a DataLad extension
Your secret hobby is Git plumbing
You know git-annex, and/or about its backends
You want to help create metadata extractors for Dataverse to generate dataset metadata automatically

Getting started

Great that you're joining us in this project! Here's a list of things that can help you to prepare or to get started:

Create a GitHub account. Ideally, set up SSH keys following the Github docs.
Clone this repository. If you haven't, install Git first the Traintrack installation instructions can help with this.

git clone [email protected]:datalad/datalad-dataverse.git

Install DataLad and its dependencies. The DataLad Handbook has installation instructions for your operating system.
Set up a Python environment. This project is written in Python, and creating a Python development environment is the best preparation to get started right away. There are a multitude of ways in which one can set up a virtual environment, and some might fit better to your operating system or to the software you already have installed. The brainhack traintrack corner can show you how to do it with Miniconda. Below, you'll find code snippets how the DataLad team usually creates their development environment.

# create a virtual environment (for Linux/MacOS)
virtualenv --python=python3 ~/env/hacking
# activate the virtual environment
source ~/env/hacking/bin/activate
# install datalad-dataverse in its development version
cd datalad-dataverse
pip install -e .

Take a look at the section "Dataverse docker for running tests" to learn how to spin up your own dataverse instance (if you are on a Linux computer or Mac). Alternatively or in addition, checkout demo.dataverse.org, a free dataverse installation for testing purposes that you can register, sign-up, and play in.
Check out the Dataverse Documentation for an overview of the software, and likewise, the DataLad docs. A few specialized dataverse doc links that may be of particular relevance are this section of the API guide, which is about third party integrations. Among other things, it mentions https://pydataverse.readthedocs.io/en/latest, a Python library to access the Dataverse API’s and manipulating and using the Dataverse (meta)data - Dataverses, Datasets, Datafiles (it will likely become this extensions backend). For metadata, there also is this guide.

Contact

The virtual lead for this project (time zone: EMEA) is @bpoldrack. The on-site lead for this project (time zone: Glasgow) is @adswa. The best way to reach us is by tagging us in issues or pull requests. You can find us and our voice channel on Discord.

Dataverse docker for running tests

The dataverse-docker repository provides everything needed to run dataverse in a container. The continuous integration test build is based on that. The setup script used for the respective AppVeyor build is under tools/ci/setup_docker_dataverse.

You can use this docker setup locally on your machine, too, provided it's running Linux and you have docker and docker-compose installed. All involved scripts (from our end as well as the dataverse-docker repo) are completely ignorant of Windows. The basic setup is this:

git clone https://github.com/IQSS/dataverse-docker
cd dataverse-docker
export traefikhost=localhost
docker network create traefik
cp .env_sample .env

If you want to customize your setup, you may want to edit this .env file. Note, that the following call to docker-compose relies on being in the directory of the docker-compose file and that .env. If that's not suitable in your case, docker-compose provides a --file and a --env-file parameter to pass their paths to. Note, however, that even if you call it with those parameters from the outside, the several specified paths in both files refer to their base directory (the cloned repo's root) and more configs underneath it. Hence, residual directories minio-data and database-data are still created within the repository.

docker-compose up -d

This should give you a running server at http://localhost:8080. Note, that it may take a moment for the server to come up. If you go to that address in a browser you should be abe to log in as dataverseAdmin with password admin (which you will instantly be required to change). This allows to create dataverses, datasets, files, etc. An API token for that user is to be found under that user's menu (upper right).

Apart from the webinterface, you should be able to send some basic requests:

$> curl "http://localhost:8080/api/search?q=data"
{"status":"OK","data":{"q":"data","total_count":0,"start":0,"spelling_alternatives":{},"items":[],"count_in_response":0}}
$> curl "http://localhost:8080/api/dataverses/root/contents"                                                                                                                   
{"status":"ERROR","message":"User :guest is not permitted to perform requested action."}

With the token the latter response would change if you authenticate that way:

$> curl -H X-Dataverse-key:<TOKEN> "http://localhost:8080/api/dataverses/root/contents"
{"status":"OK","data":[{"type":"dataverse","id":2,"title":"Dataverse Admin Dataverse"}]}

Several additional notes:

The initial docker-compose call could give an error message if you already have an active port mapping. However, this does not necessarily mean you need to change. It may work just fine. Try despite such an error message.
This is only the most basic setup. Within the container there are a bunch of setup scripts to initialize the demo version. If you get into the container, you can find them in HOME (/opt/payara/dvinstall). You may want to check out setup-users.sh and setup-dvs.sh to see some basic API requests setting up users and dataverses. setup-dvs.sh doesn't actually run, though, since setup-users.sh doesn't assign pete the required permissions. That piece seems to be missing.
The dataverse-docker repo comes with a bunch of docker-compose recipes. We currently use just the default.
If you run this locally and need to start over for some reason, note that there is some persistent files that would need to be wiped out, too. They are generated during the setup described earlier. A git status or datalad status should show you the .env file and a minio-data/ directory as untracked. However, there's one more: database-data/. That one is readable only by root, hence a regular user's git status doesn't show it. A sudo git status would confirm, that this is untracked, too.
If you need to wipe out your local instance and you don't have any other docker images, then this command may be useful for you: for i in $(docker ps -qa); do docker stop $i && docker rm $i; done; It stops all containers and removes them, so you get back to where you started. If you only want currently running containers to be affected, leave out the a option to docker ps. If you have other containers running, I assume you know how to deal with that anyway. Don't forget to rm everything untracked in dataverse-docker. If you followed those instructions, this would be .env and the two directories minio-data and database-data. The latter is only readable by root, so you'll need sudo for rm'ing it as well as to see it reported untracked by git status.

The CI setup results in two users and a root dataverse being created. A superuser 'testadmin' and a regular user 'user1'. Their API tokens are acccessible for the tests via the environment variables TESTS_TOKEN_TESTADMIN and TESTS_TOKEN_USER1 respectively. If you want to see how that is done, so you can reproduce it locally check setup_docker_dataverse.sh for how it calls docker cp to copy several JSON files and init_dataverse.sh into the container and then executes the latter. Both scripts and the JSONs are here in the repository under tools/ci.

adswa / datalad-dataverse Goto Github PK

datalad-dataverse's Introduction

DataLad Dataverse extension

Getting started

Contact

Dataverse docker for running tests

datalad-dataverse's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent