datalad-handbook / course Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 3.0 230.9 MB

Talks and materials for workshops based on the DataLad handbook

License: Other

HTML 98.99% CSS 0.26% JavaScript 0.04% Shell 0.71%

course's People

Contributors

Stargazers

Watchers

Forkers

llevitis jbpoline yarikoptic

course's Issues

1.5 day workshop in Lucca

@mih and I will be giving a workshop on DataLad in Lucca on March 23rd-24th. This issue lists the TODOs and acts as a progress tracker.
Please extend and edit as necessary. :)

Logistics

await Feedback from Lucca on dates
await Feedback from Lucca on GDrive account
figure out travel
- ~~@adswa (I will likely take a train. Depending on when we plan to arrive, there is a nice one overnight, arriving at 7 something in the morning)~~ EDIT: both of us will go to Pisa from Montreal
- @mih

Software

write a custom wrapper around a special remote for gdrive.
- Figure out which software to base it on. Rclone seems to work, but there also seems to be git-annex-remote-googledrive, listed under "gitannex/tips", and directly linked as a specialized service.

Teaching

A Basics layout has been proposed by @mih and awaits feedback from Lucca

Datalad concepts and principles
Basics of local data/code version control
- Hands on: tasks to exercise basic building blocks
Modular data management for reproducible science
- Hands on: implement sketch of a reproducible paper
Data management for collaborative science
- Hands on: Using your infrastructure (Gdrive) to collaborate on a
  demo project
Data publication
- Hands on: Publish data on "GitHub"
Outlook (what is else possible, resources, use cases)
Potential group work: Small sets of people are given problems to solve with DataLad and present

This is currently structured like this:
Monday 23 Morning session
1 Datalad concepts and principles
2 Basics of local data/code version control + Hands on: tasks to exercise basic building blocks

Monday 23 Afternoon session
1 Modular data management for reproducible science + Hands on: implement sketch of a reproducible paper
2 Data management for collaborative science + Hands on: Using your infrastructure (Gdrive) to collaborate on a demo project

Tuesday 24 Morning session
1 Data publication + Hands on: Publish data on "GitHub"
2 Outlook (what is else possible, resources, use cases)

Resources to create

rclone GDrive wrapper (started here datalad/datalad#4162)
slides
code lists
sketches of a LaTeX (?) skeleton for a reproducible paper. @adswa could potentially use resources she will help to improve at the Turing Way book dash.
Data to use for examples and to publish to Gdrive
Optional/Wishlist: Some sort of audience response system. EduVote (Browser-based, Google Forms, ...? E.g., in the form of: "How confident are you using --> rating scale"
Workshop feedback (potentially pre-post, to learn about attendees expectations before and after the course, knowledge gain. Also remember to collect Feedback on DataLad

public url for pics/slides?

got interested in sandwhich03.svg which comes from that submodule but that submodule has ssh url for it

(git)lena:~datalad/datalad-handbook/course[master]git
$> datalad -f json_pp subdatasets pics/slides
{
  "action": "subdataset",
  "gitmodule_name": "pics/slides",
  "gitmodule_url": "kumo.ovgu.de:/home/mih/public_html/datalad/slides",
  "gitshasum": "76882e01a9194444b507491889e7d9f6d6dcb6b2",
  "parentds": "/home/yoh/proj/datalad/datalad-handbook/course",
  "path": "/home/yoh/proj/datalad/datalad-handbook/course/pics/slides",
  "refds": "/home/yoh/proj/datalad/datalad-handbook/course",
  "state": "absent",
  "status": "ok",
  "type": "dataset"
}

IRTG Workshop Aachen

When: November 26th, 2019, 4pm
Where: Same library seminar room as before
Duration: 2 hours
Participants: 25 grad students, various backgrounds (neuroscience, psych, bio, physics, engineering, medicine), workshop will be made compulsory

Communicated expectations on content:

DataLad
BIDS

TODO

Dienstreiseantrag
Short description/overview to distribute in advance
Slides/casts
Code/materials for participants

Own thoughts

The time is extremely limited: The workshop needs to get them motivated to learn the tools (e.g., start with reproducible paper teaser, and for BIDS maybe show brainlife.io), give a brief introduction into the basics principles (prob. Dataset basics and as shortened Reproducible execution session), and above that contain pointers to everything that is relevant for subsequent self-study.
Based on the conversation with Julia and HanGue, students don't seem to know about version control/Git, BIDS or any standard structure. Teaching them the very basics alone will already make a large difference to their workflows.
possibly: collate a sheet with a collection of useful links.

ABCD-ReproNim Course

Date: Jan 22nd 2020
Tentative schedule:

ReproNim: Data Versioning and Transformation with DataLad
Instructor: Adina Wagner*, Institute of Neuroscience and Medicine (INM-7)
Why Should Data Be Versioned?
Simple DataLad Transform: Retrieve, Compute, Store Results
Create a Dataset
Using DataLad with Containers on the Dataset
Rerunning and Checking Analysis Differences

Submission due: Dec. 15th

Todos:

Pre-record your lecture (details to be provided separately) by September 15th/December 15th (depending on your 'session’; see syllabus);
Be available for your 1-hour question and answer period with the students on the Friday at 1pm EST/10am PST as indicated in the syllabus;
Provide 1-2 readings/watchings (~30 minutes) you would like to assign prior to your lecture;
Review the homework assignment generated by the TA team before distribution to the students;
(Optional) Attend the "virtual" workshop March 8-12, 2021.

Useful free tool for simple audience polling: https://www.directpoll.com

This tool is very useful:

create questions in advance (expires after 30 days unless you "save" it again)
embed the live results into the presentation (using an <iframe></iframe> tag):

    <iframe src="https://directpoll.com/r?XDbzPBdJ2bAX0ZEC2YlWLumm6WtYBkChGSFh5Vwe4W"
    title="This is my poll", width="900", height="900"></iframe>

Cast_live should log into brainbfast and execute commands there...

... instead of executing everything on my machine.

Lessons from datamanagement support sessions based on the book

It would be useful to have an interactive run session (e.g., datalad run nano).
Building up the command by try-and-error as in the book doesn't work as good in a workshop session - It is hard to motivate why we run into all of these errors, and easy to lose track of what it is we're trying to achieve

DebConf Talk on DataLad, due August 15th

The DebConf talk proposal was accepted.
Here is the abstract:

Title: DataLad - Decentralized Management of Digital Objects for Open Science

With a general awareness of a reproducibility crisis in many scientific areas and increasing importance of research data management in science and policy making, data-driven fields require convenient and scalable data management solutions. Standing on the shoulders of Git and git-annex (git-annex.branchable.com/, Joey Hess), DataLad provides a decentralized solution that enables the joint management of code, data, and complete containerized computational environments in a scalable and distributed fashion. With features such as unambiguous version control, a wide spectrum of data transport mechanisms, convenient provenance capture, and re-execution for verification or as an alternative to storage and transport, it enables and facilitates many aspects of open and reproducible science: collaboration, sharing, analytical transparency, computational reproducibility of digital research objects, and disk-space aware storage and computing workflows on infrastructure that ranges from personal laptops up to supercomputers.

In this talk, we will introduce DataLad, present its main features which should be of interest to the audience regardless of their relation to any field of science, and share the process and status of its adoption in the neuroimaging community.

Recording tips: https://debconf-video-team.pages.debian.net/docs/advice_for_recording.html

Book vs course

The goal is to develop a course, based on the book while minimizing the amount of disconnected material, and therefore making it easier to evolve book and course together with the evolution of datalad

the course and the book share the exact same content, but the former is performed, while the latter serves as the syllabus
code examples in the book are actually executable. we use this feature to turn them into "cast" scripts. once in that form, we can use the cast_live tools from DataLad to demo them in a course installment
each code example in the book needs to be equipped with a "caption" that can then serve as a narrative cue in the cast script. The caption could then also be displayed in the book itself.
each code example in the book needs to get a tag or label that can be used to subselect examples that make up a shorter, but still internally consistent narrative -- this aids the generation of shorter course installments
initially the slides of the course material are based on the "summary" components of each chapter, plus relevant key figures. once tailored to and validated by the teaching the course, their content is fed back into the book (possibly using a new dedicated markup). Each slide contains a link to the respective part of the book, where more details are available. The link is possibly implemented as a QR code.
the order of topics in the course matches the order in the book. if it turns out that this order is suboptimal it needs to be adjusted in both book and course. consequently, the course starts with basics and a uniform narrative, and ends with more standalone scenario descriptions.
the course starts with, or is following a "pitch" that outlines an attractive take-away for a respective target audience. Candidate pitches are any "use case" chapter.
slide decks for course installments are based on reveal.js, and are more or less fully generated using the book sources are a (set of) templates. Each chapter has its own slide deck.
analog to the book, each session/chapter (and in particular the early ones) must communicated in a self-evident fashion, why their content/objective is important, and applicable to practical problems a target audience can relate to.

Content (based on current book)

Setup: Git ID, installation, what is a terminal
Datasets (create, save, install, nesting): basic local version control, manual log keeping
Run: basic provenance tracking , automatic log keeping
Git-annex basics: disaster recovery (needs merge of currently disjoined chapters git-annex and help yourself
Collaboration: yes!
YODA: using the conceptual pieces optimally for maximum practical benefits -- this will be and is a mostly conceptual part

Each of these "basics" chapters is handled in a 90min installment.

After the initial sessions on "basics" and number of use case descriptions can follow.

For the initial run at INM7, we will have a dedicated "How to work with the local infrastructure" session that could take place any time after (3). This will the also turn into a use case chapter in the book.

Instead of a weekly or biweekly frequency, this course can also be tought as a 2-day block event, with the basics on day 1, and a re-cap + use cases on a (shorter) day 2.

HCP data related course for the MPI CBS

Some time in September,
remote
30min concepts, 30-90min hands-on.
centered around how to get and analyze HCP data.

This will be cool!

MPI Workshop in November

Registering as a talk/workshop todo. Info is in https://github.com/adswa/mpi-datamanagement-ws/.
Takes place November 18th, full day.

Educating for a FAIR future talk at the NWG, due Feb 22nd

10 minute video, prerecorded - young investigator presentation
live discussion virtually, March 28th, evening

abstract:
With a growing awareness of the role of sample size and replicable results (Button et al., 2013; Turner et al., 2018), a rise of platforms, tools, and standards that aim to facilitate data sharing and management (Wiener et al., 2016), unprecedented sample sizes (e.g., UKBiobank; Bzdok & Yeo, 2017), and increasingly complex data analyses (e.g, Glasser et al., 2013; Alfaro-Almagro et al., 2018), research data management (RDM) is essential to put open and FAIR neuroimaging research into effect. But just as FAIRness and RDM can not be an afterthought in any given scientific project, they also shouldn’t be an afterthought in the training and education of current and future generations of neuroscientists. This training has to fulfill the demands of different stakeholders in science: 1) Researchers, that apply RDM in their scientific projects, 2) PIs and similar personnel with management tasks, that need to set out and justify plans for the implementation of RDM and FAIR principles, and 3) trainers, such as librarians or data managers, that educate users on tools and practices for FAIR science (Fothergill et al., 2019, Grisham et al., 2016). Researchers of any career level and of any background need accessible tutorial-like educational content and documentation for relevant tools and concepts to apply FAIR RDM from the get go. Planners need high-level, non-technical information in order to make informed yet efficient decisions on whether a tool fulfils their needs. And trainers need reliable, open teaching material.
A user-driven alternative to scientific software documentation by software developers, “Documentation Crowdsourcing”, has been successfully employed by the NumPy project (Oliphant, 2006; Pawlik et al., 2015). Extending this concept beyond documentation, we have created the DataLad handbook (handbook.datalad.org) as a free & open-source, user-driven and -focused educational instrument and resource for trainers, users, and planners for (research) data management, independent of their background and skill level (Wagner et al., 2020). Drawing from the experiences of creating more than 400 pages of educational material, with almost 40 independent contributors from around the world, and nearly 2 years of in-person and virtual teaching based on the handbook, I want to highlight the unique challenges of RDM training and as well as its opportunities for the field of neuroscience.

some figures used in the talks are missing

just was trying to get a glimpse of https://github.com/datalad-handbook/course/blob/0b26cb6ac9a5d6c2d5bd5473a92d0284d959ec79/talks/hhu.html but it seems that most of the figures, such as e.g. talks/hhu.html: <img height="850" class="fragment fade-in" src="../pics/ukb_datasets.svg"> are nowhere to be found.

Handbook2livecasts: Todos for cast_live and automatically creates casts

This is to document how to turn the handbook into cast_live scripts.

Create a cast with annotated code snippets in the handbook (see datalad-handbook/book#217 for insights on how to do this)
Use a custom version of DataLads cast_live to to "play" it

TODO:

update the cast_live tools to run without obscure failure (XGetWindowProperty[_NET_WM_DESKTOP] failed (code=1))
- the command that fails is xdotool windowactivate --sync $(xdotool getwindowfocus)
create a copy of appropriately customized cast live tools in this repo
add the casts (as soon as they are created)

15 Min talk in Oldenburg, Nov. 2nd and 3rd

For a symposium "Open and Reproducible Neuroimaging: Integration of community developed tools from data acquisition to publication". Michael and I will both have a 15 min slot to talk about data storage and retrieval.