Giter Club home page Giter Club logo

datalad's Introduction

 ____            _             _                   _ 
|  _ \    __ _  | |_    __ _  | |       __ _    __| |
| | | |  / _` | | __|  / _` | | |      / _` |  / _` |
| |_| | | (_| | | |_  | (_| | | |___  | (_| | | (_| |
|____/   \__,_|  \__|  \__,_| |_____|  \__,_|  \__,_|
                                              Read me

DOI Travis tests status Build status Extensions Linters codecov.io Documentation License: MIT GitHub release Supported Python versions Testimonials 4 https://www.singularity-hub.org/static/img/hosted-singularity--hub-%23e32929.svg Contributor Covenant DOI RRID

All Contributors

Distribution

Anaconda Arch (AUR) Debian Stable Debian Unstable Fedora Rawhide package Gentoo (::science) PyPI package

10000-ft. overview

DataLad makes data management and data distribution more accessible. To do that, it stands on the shoulders of Git and Git-annex to deliver a decentralized system for data exchange. This includes automated ingestion of data from online portals and exposing it in readily usable form as Git(-annex) repositories, so-called datasets. The actual data storage and permission management, however, remains with the original data providers.

The full documentation is available at http://docs.datalad.org and http://handbook.datalad.org provides a hands-on crash-course on DataLad.

Extensions

A number of extensions are available that provide additional functionality for DataLad. Extensions are separate packages that are to be installed in addition to DataLad. In order to install DataLad customized for a particular domain, one can simply install an extension directly, and DataLad itself will be automatically installed with it. An annotated list of extensions is available in the DataLad handbook.

Support

The documentation for this project is found here: http://docs.datalad.org

All bugs, concerns, and enhancement requests for this software can be submitted here: https://github.com/datalad/datalad/issues

If you have a problem or would like to ask a question about how to use DataLad, please submit a question to NeuroStars.org with a datalad tag. NeuroStars.org is a platform similar to StackOverflow but dedicated to neuroinformatics.

All previous DataLad questions are available here: http://neurostars.org/tags/datalad/

Installation

Debian-based systems

On Debian-based systems, we recommend enabling NeuroDebian, via which we provide recent releases of DataLad. Once enabled, just do:

apt-get install datalad

Gentoo-based systems

On Gentoo-based systems (i.e. all systems whose package manager can parse ebuilds as per the Package Manager Specification), we recommend enabling the ::science overlay, via which we provide recent releases of DataLad. Once enabled, just run:

emerge datalad

Other Linux'es via conda

conda install -c conda-forge datalad

will install the most recently released version, and release candidates are available via

conda install -c conda-forge/label/rc datalad

Other Linux'es, macOS via pip

Before you install this package, please make sure that you install a recent version of git-annex. Afterwards, install the latest version of datalad from PyPI. It is recommended to use a dedicated virtualenv:

# Create and enter a new virtual environment (optional)
virtualenv --python=python3 ~/env/datalad
. ~/env/datalad/bin/activate

# Install from PyPI
pip install datalad

By default, installation via pip installs the core functionality of DataLad, allowing for managing datasets etc. Additional installation schemes are available, so you can request enhanced installation via pip install datalad[SCHEME], where SCHEME could be:

  • tests to also install dependencies used by DataLad's battery of unit tests
  • full to install all dependencies.

More details on installation and initial configuration can be found in the DataLad Handbook: Installation.

License

MIT/Expat

Contributing

See CONTRIBUTING.md if you are interested in internals or contributing to the project.

Acknowledgements

The DataLad project received support through the following grants:

  • US-German collaboration in computational neuroscience (CRCNS) project "DataGit: converging catalogues, warehouses, and deployment logistics into a federated 'data distribution'" (Halchenko/Hanke), co-funded by the US National Science Foundation (NSF 1429999) and the German Federal Ministry of Education and Research (BMBF 01GQ1411).

  • CRCNS US-German Data Sharing "DataLad - a decentralized system for integrated discovery, management, and publication of digital objects of science" (Halchenko/Pestilli/Hanke), co-funded by the US National Science Foundation (NSF 1912266) and the German Federal Ministry of Education and Research (BMBF 01GQ1905).

  • Helmholtz Research Center Jülich, FDM challenge 2022

  • German federal state of Saxony-Anhalt and the European Regional Development Fund (ERDF), Project: Center for Behavioral Brain Sciences, Imaging Platform

  • ReproNim project (NIH 1P41EB019936-01A1).

  • Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under grant SFB 1451 (431549029, INF project)

  • European Union’s Horizon 2020 research and innovation programme under grant agreements:

Mac mini instance for development is provided by MacStadium.

Contributors ✨

Thanks goes to these wonderful people (emoji key):

glalteva
glalteva

💻
adswa
adswa

💻
chrhaeusler
chrhaeusler

💻
soichih
soichih

💻
mvdoc
mvdoc

💻
mih
mih

💻
yarikoptic
yarikoptic

💻
loj
loj

💻
feilong
feilong

💻
jhpoelen
jhpoelen

💻
andycon
andycon

💻
nicholsn
nicholsn

💻
adelavega
adelavega

💻
kskyten
kskyten

💻
TheChymera
TheChymera

💻
effigies
effigies

💻
jgors
jgors

💻
debanjum
debanjum

💻
nellh
nellh

💻
emdupre
emdupre

💻
aqw
aqw

💻
vsoch
vsoch

💻
kyleam
kyleam

💻
driusan
driusan

💻
overlake333
overlake333

💻
akeshavan
akeshavan

💻
jwodder
jwodder

💻
bpoldrack
bpoldrack

💻
yetanothertestuser
yetanothertestuser

💻
Christian Mönch
Christian Mönch

💻
Matt Cieslak
Matt Cieslak

💻
Mika Pflüger
Mika Pflüger

💻
Robin Schneider
Robin Schneider

💻
Sin Kim
Sin Kim

💻
Michael Burgardt
Michael Burgardt

💻
Remi Gau
Remi Gau

💻
Michał Szczepanik
Michał Szczepanik

💻
Basile
Basile

💻
Taylor Olson
Taylor Olson

💻
James Kent
James Kent

💻
xgui3783
xgui3783

💻
tstoeter
tstoeter

💻
Stephan Heunis
Stephan Heunis

💻
Matt McCormick
Matt McCormick

💻
Vicky C Lau
Vicky C Lau

💻
Chris Lamb
Chris Lamb

💻
Austin Macdonald
Austin Macdonald

💻
Yann Büchau
Yann Büchau

💻
Matthias Riße
Matthias Riße

💻
Aksoo
Aksoo

💻

macstadium

datalad's People

Contributors

adelavega avatar adswa avatar akeshavan avatar andycon avatar aqw avatar asmacdo avatar bpinsard avatar bpoldrack avatar chrhaeusler avatar christian-monch avatar debanjum avatar dependabot[bot] avatar disastermo avatar driusan avatar effigies avatar glalteva avatar jsheunis avatar jwodder avatar kimsin98 avatar kyleam avatar loj avatar matrss avatar mih avatar mikapfl avatar mslw avatar nobodyinperson avatar taylols avatar thechymera avatar vsoch avatar yarikoptic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datalad's Issues

Redefine prototype.

Not really an issue, just a comment on your top-level readme:

It is currently in a "prototype" state, i.e. a mess.

that's classic!

CRCNS.org support

During SfN Jeff Teeters shared good news on available of alternative methods of downloading datasets from the crcns.org which should support versioning etc, research more in detail:
http://crcns.org/download

expose "S3 client" capabilities to custom special remotes

a really wild question: to provide access to S3 buckets not exposed publicly (thus requiring authentication via IAM credentials, such as HCP) and through the prefix/revisionids, instead of just to a pure keystore (as in native git-annex S3 special remote), we would need to provide an external special remote which would need again to implement S3 client capabilities for basic authentication/fetching. It can be done but we will grow dependencies implementing the same functionality as what git-annex already has inside.
So a wild thought came: may be git-annex could somehow expose S3 functionality to e.g. external special remotes through some API? ;) could well be made into something even more generic. git-annex has already capabilities for talking to a plethora of data hosting providers and thus may be could serve as a glorified "ultimate downloader"? or to say may be to expose itself as yet another special remote capable of TRANSFER?

More fine-grained access permissions

Per discussion with @nicholsn -- some use cases might require more fine-grained permission for access of sub-portions of the dataset (e.g. just behavioral data, or just anatomicals). Most probably could be achieved with modularization going beyond subject level (thus #40 was mentioned)

publish for backup

one of the problems many labs might have is limited if any offsite backup of their data. Given large amounts of processed etc data, it might be worth having a mode where we "publish" repositories to e.g. external hard drive copying only "precious" data (e.g. original raw/preprocessed). git-annex has notion of "preferred content" but it is assigned per "source" repository where we would need something like "preferred content to publish to X" or just declare some files/directories "precious" (may be just a tag) Then it would be useful to allow incremental updates of backups on external drive stating e.g. "publish --to=/media/drive --tags=precious"

@joeyh - what would be your thoughts on how to wrap it up? tags?

Allow for "lean" annex views with only files having a content

NB I thought I have filed it here but it must have been elsewhere, so backlink will be provided later

git annex view provides great functionality to take advantage of meta-data (tags) associated with data files for custom views. One of the stumble points might often be "installation" of laarge datasets where only a handful of files are actually needed/used. ATM it results in a directory hierarchy where possibly majority of files are broken links, which makes navigation difficult and non-productive.

It would be great if there was a way to generate a "lean" view where only files with content available are visible.

Somewhat related would be #6, i.e. to carry an update while maintaining a lean view

git-annex add (git ls-files) etc performance on laaarge datasets

Just a topic for possible discussion etc

Just for the sake of seeing how big would get a really big dataset, I am "simulating" an annex repository with all the files distributed by HCP 500 subjects release, which, as deposited to S3, has >5,600,000 files. It seems to be running for a while now ;)
Some of the initial bottlenecks I have detected:

  • git-annex add invokes 'git ls-files --others --exclude-standard -z' which is apparently CPU-hungry (not bound by IO) and does quite a bit of exploration (e.g. for .git in every subdirectory). Might be worth profiling and getting some stats on what it spends time on. or if 'annex add' had option '--initial' or '--plain-ls' which would avoid using ls-files (but might have side-effects etc)
  • further in the process, when actually files get annexed, git-annex itself uses some locking mechanism so is quite busy opening/closing(?, just looking at fds) journal.lck . I wonder if the process of 'annex add'ing could somehow be parallelized if not made more efficient for such obnoxious cases

The script I am using is here: https://github.com/datalad/datalad/blob/master/tools/mimic_repo (really not sure why I did it in python instead of a simple bash script). and here are the corresponding 's3cmd ls' outputs I am running on:
http://www.onerussian.com/tmp/hcp-ls.20141020-500subject-only.txt.gz
or only just a 1000 (to give a try)
http://www.onerussian.com/tmp/hcp-ls.20141020-500subject-only1000.txt.gz

so you could run e.g. with

./mimic_repo hcp-ls.20141020-500subject-only1000.txt.gz /tmp/testtt

consider making output of 'git-annex get' less verbose

ATM it is just output of wget, so screen gets filled up with all kinds of stuff quite quickly ruining some times precious terminal history . In the light of #10 it might be worth considering making default fetching of data less verbose and more or less 'Ok' or 'Fail' status being reported. Also I somewhat like how docker manages to report its progress in the terminal (curses?) where multiple lines show progress and not that much of terminal screen is wasted.

operation to "get" content for files which previously had content fetched

Imagine an annex repository which had some files' content fetched (via 'annex get'). Then modifications to some of those and may be other files were done in the original repository, so if a plain "git pull" is performed, symlinks to new content would get broken and would require manual "git annex get" on those files which previously were "get"ed at some point in the past. AFAIK (and from our discussion on IRC, cited below, thanks patagonicus and scalability-junk for discussion) there is no way currently to achieve that without additional script on top in two possible ways:

  1. collecting list of files with local content before "pull" and then "get"ing them after
  2. retrospection of git/git-annex history for each file without content either it was previously obtained and its content key not dropped later on explicitly.

FWIW -- here is a protocol of IRC session

10:05   yoh: is there a way to 'upgrade' the content which was 'got' already?  e.g. I have some files which got their 
             content delivered to local annex.  then I want to 'git pull' + 'git annex get 
             those_files_for_which_I_had_content'
10:07   patagonicus: yoh: Probably git annex merge, if pull already got all the branches it needs. No need to do an 
                     extra get, if the data was already copy --to the repo (merge will update your master branch so 
                     that the new file's symlinks appear and the link target will already be there)
10:08   warp: just do git annex sync?
10:09   scalability-junk: yeah not sure why someone would manually git pull and push with git annex sync.
10:09   bremner: sync either gets all content or none
10:09   scalability-junk: bremner: nope it gets all metadata aka branches and git stuff
10:09   scalability-junk: content is only synced with git annex sync --content
10:09   bremner: yes, and then it syncs all content, like I just said
10:09   scalability-junk: and even then it's not all content, but the content which is preferrer or has not 
                          enough duplicates
10:10   patagonicus: bremner: No, it's based on preferred content
10:10   scalability-junk: *preferred
10:10   bremner: ok, fine. It _still_ doesn't answer yoh's question.
10:10   patagonicus: But sync does push/pull all branches, so if you want to only sync some. And if you want 
                     to rewrite history you'll probably have to fetch/push --force and do some resets.
10:11   scalability-junk: rewriting history is a pain
10:11   yoh: keep also in mind that sync tries to be bidirectional -- I want a clean one-way.
10:13   scalability-junk: Wasn't said. Alright so yeah git merge probably
10:18   yoh: checked -- nope -- merge didn't get it and there is no --content for it
10:19   patagonicus: yoh: Are the symlinks there? And you already pushed that file with git annex copy --to 
                     from a different repo?
10:19   yoh: I am just trying with two local repos I made for testing... there is no need to copy --to
10:20   patagonicus: Eh … then how was the content "'got' already"?
10:20   yoh: ok - 1 sec
10:23   yoh: eh -- history a bit messy to share... s215bLgyFs
10:23   yoh: http://slexy.org/view/s215bLgyFs
10:23   yoh: so this way ;)  after initial clone I 'get' some interesting file.  then if they get modified 
             in origin, I would like to "update" them locally as well, but only them
10:24   yoh: merge itself doesn't even pull... may be I should fetch before and then merge,... let's see
10:24   patagonicus: Yeah, merge does not do any fetch/pull/push. It just uses the synced/* branches that 
                     are available to update master and git-annex
10:25   patagonicus: You have no line that would transfer the "123" content from d1 to d2, so it's not there.
10:26   patagonicus: You'll have to run a git annex get afterwards. Or add d2 as a remote to d1 and run git 
                     annex copy --to=d2 at any time.
10:26   yoh: I understand that... and that is what I would like to achieve -- that some command does that 
             content transfer for those files which had local content before
10:26   patagonicus: So, basically git annex sync --content without pushing to the remote?
10:26   yoh: sync  would sync/get all the datafiles
10:27   yoh: I would like only those present on client already (server shouldn't be aware of the client, so 
             no copy --to)
10:28   patagonicus: What do you mean with present? Based on the file names or on the content? Because in 
                     your example you have 1 file but two contents (so two different keys in annex' storage. 
                     One content will be present, one (the 123 one) will not).
10:29   yoh: on the file names
10:30   patagonicus: Oh. That's going to be complicated, I think. So if there's two files, A and B, and the 
                     client has the content of A and both A and B get changed on the server, only the new 
                     content for A should be made available on the client?
10:30   yoh: yes
10:31   yoh: I guess could also be done via retrospection in git/git-annex
10:33   patagonicus: There's no such feature built into annex at the moment (and I'm not sure if it ever 
                     will be added). Basically what you want to do is: run git annex find (which will list 
                     all files currently in the repo, probably with --print0), run git annex merge to update 
                     master, the run git annex get with the result of the find you previously did. That 
                     wouldn't catch file renaming, but for the basic case
10:34   patagonicus: it should work. Needs N*M space, though, where N is the number of files in the local 
                     repo and M is the average file name length.
10:34   patagonicus: I think git annex find --print0 >here && git annex fetch -a && git annex merge && xargs 
                     -0 git annex get <here or something like that should work.
10:35   patagonicus: Will break if there's ever a merge in which file content was changed that is done 
                     without the find/get pair.
10:35   yoh: yeap, something like that
10:36   yoh: because of such corner-cases I think it might be better to do it (optionally?) via full 
             retrospection -- if each file without content ever had content before and was not explicitly 
             dropped
10:37   patagonicus: Then the runtime would depend on the number of commits on master (times the number of files).
10:37   patagonicus: Should be doable with a bit of scripting, though.
10:38   patagonicus: The "explicetly" dropped would be harder. Basically you could only find out if any of 
                     the previous versions of a file is still in the local repo. However, if you keep 
                     running git annex unused && git annex dropunused all those would never be in the local 
                     repo.
10:39   yoh: patagonicus: yes, I know.  But such full retrospection could then be optional for explicit run 
             and/or called if manual operation was detected... may be there could even be some record of a 
             last state when things were "in order" to start from that point

Composite (git submodules?) dataset handles with support for the views

As HCP and NCANDA (via @nicholsn with 10mil files) shows, it might be in-feasible to aim supporting shipping the entire beast in a single git/git-annex repository, which suggests modularizing at least at the level of a single subject (or may be subject/modality or subject/visit) to be managed with e.g. git submodules or some "native" datalad distribution way. But then such features as "views" of git-annex do not want to be lost since they provide really nice functionality to reexpose that data in an analysis specific convenient layout (given the meta-data is contained within annex). So this all requires additional thought on how to go about.

@joeyh -- any ideas on how to go about growing large repositories partitioning/handling?

aspera support

Aspera is a company (http://asperasoft.com , apparently belonging to IBM now) developing products for the high-performance transfer -- it intends to fill up the pipe even while going through the WAN . Currently used by NIH (for NCBI) and HCP.

Pros:

  • efficient transfer

Cons:

  • closed source proprietary patented technology
  • even clients are closed source
  • freely available client might be difficult to "drive" due to a rich variety of settings/options etc

Theoretically nothing forbids supporting it, but logistically it might be very messy (see cons).

with_tempfile needs to be chainable

probably by taking *_kwargs and pop'ing only a limited set of arguments, passing *args and the rest of *_kwargs deeper inside

test also by nesting multiple with_tempfile's

Facilitate space-efficient throw-away clones of dataset handles

If I have a big dataset and I want to do a number of analyses with multiple users in a shared computing environment, it would be nice to be able to do that in a way that has minimal impact on storage demands.

In my concrete case, I have about 100GB of raw input data that is currently copied for each clone of the dataset handle. Of course I can avoid that by manually hard/soft linking the relevant files, and only later inject potential results of derived data into the original annex. However, this breaks the workflow with dataset handles.

Check git annex' error reporting

By now this intended to be a reminder to figure things out.
Just stumbled upon a case where I git annex init a repo and then got "init ok" on stdout, "fatal: ref HEAD is not a symbolic ref" on stderr and exitcode 0 (or at least None).
So, datalad didn't notice something went wrong. Needs further investigation.

"Reviewer mode"

It is often needed to provide anonymous access to data (or a subset) for peer-review. We should have (at least) some documentation on how to achieve this with datalad.

git-annex addurl should respect Content-Disposition filename

as far as I see it on quick code grep or on this test -- doesn't have an option to use those:

$> mkdir test; cd test; git init; git annex init;  git annex addurl --pathdepth=-1  http://human.brain-map.org/api/v2/well_known_file_download/157722290
Initialized empty Git repository in /tmp/test/.git/
init  ok
(Recording state in git...)
addurl 157722290 (downloading http://human.brain-map.org/api/v2/well_known_file_download/157722290 ...) 
/tmp/test/.git/annex/tmp/URL--http&c%%huma     [                                   <=>                                                    ]   3.07M   391KB/s   in 8.3s   
ok
(Recording state in git...)

$> ls 
157722290@

$> ~datalad/datalad/tools/urlinfo http://human.brain-map.org/api/v2/well_known_file_download/157722290
URL:  http://human.brain-map.org/api/v2/well_known_file_download/157722290
Date: Wed, 21 Jan 2015 18:04:35 GMT
Server: Apache
Content-Disposition: attachment; filename="T1.nii.gz"
Content-Transfer-Encoding: binary
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: POST, GET, OPTIONS
Access-Control-Max-Age: 1728000
Cache-Control: private
X-UA-Compatible: IE=Edge,chrome=1
X-Request-Id: aa0a12faabba30384fc13fc9fecda5e2
X-Runtime: 0.007028
X-Rack-Cache: miss
Status: 200 OK
Content-Type: application/octet-stream
Vary: Accept-Encoding
Content-Encoding: gzip
Connection: close
Transfer-Encoding: chunked
Set-Cookie: BIGipServerHuman_Pool=2502510346.20480.0000; path=/

XNAT

http://xnat.org is the most widely used FOSS platform for neuroimaging data management and distribution. Used by openfmri, nitrc, humanconnectome, etc.

Pros:

  • python-xnat provides programmable interface so should be relatively straightforward

TODOs:

  • no built-in facility for versioning data files

COINS

http://coins.mrn.org is a FOSS-wannabe platform used by many researchers.

TODOs

  • since it is not yet open-source, not aware of its authentication/etc API
  • versioning of data files is absent AFAIK

Issue to be closed during #closember 2021 at the latest.

'{' is not recognized as an internal or external command

======================================================================
FAIL: Verify that all our repos are clonable
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "c:\buildslave\datalad-tests-virtualbox-dl-win7-64\build\datalad\tests\utils.py", line 349, in newfunc
    t(repo, *arg, **kw)
  File "c:\buildslave\datalad-tests-virtualbox-dl-win7-64\build\datalad\tests\utils.py", line 239, in newfunc
    t(*(arg + (filename,)), **kw)
  File "c:\buildslave\datalad-tests-virtualbox-dl-win7-64\build\datalad\tests\test_testrepos.py", line 38, in test_clone
    eq_(status, 0, msg="Status: %d  Output was: %r" % (status, output))
AssertionError: Status: 1  Output was: "'{' is not recognized as an internal or external command,\noperable program or batch file.

Message seems to be way off for figuring out WTF. Reference:
http://smaug.datalad.org:8020/builders/datalad-tests-virtualbox-dl-win7-64/builds/3/steps/nosetests/logs/stdio

Check with git folks if pre-add hook is feasible and if so -- do it

That could allow use of largefiles or whatever else to simplify interactions with annex -- they would just automagically added to annex (instead of index) in 'pre-add' hook, and/or may be user be alerted that he is trying to commit a large file directly to git, while it is not matching largefiles selection, ...
Or am I dreaming @joeyh? ;-)

allow to drop "old" files (e.g. not referenced after commit X)

I wondered if there is a convenience construct already which would allow to drop (may be forcefully) some files which are no longer "of interest".

use case: carrying out an analysis while keeping results under git-annex control. With reiterations of the analyses a pile of results accumulates but may be majority of them is not even worth keeping at all (not just locally). It would be nice to have ability to drop all the load which e.g. is not referenced by any commit after X (in a given branch, situation with multiple branches might be trickier)

slow basic FS operations on large git-annex repositories .git/annex'es due to KEY/ directories

echoing discussion we are having in #17, I wanted to also bring one about non-direct (i.e. regular/classical) annex'es and their impact. As http://git-annex.branchable.com/internals/ and http://git-annex.branchable.com/internals/lockdown/ outlines, to prevent accidental removal of the file, every file is placed under .git/annex/objects/aa/bb/KEY/KEY directory. where for KEY directory 'w' permission is taken away so the KEY file can't be removed. It indeed solves the accidental removal issue, but

  1. it nearly doubles inodes needed for the repository. So on large ones, impact on the file system might be notable (on smaug I have 100mil inodes in use now :-/ could have been just 50mil? ;))
  2. operations (even a simple du or ls -R) might take MUCH longer since leaf directories are to contain a single file and thus there is quite a heavy file hierarchy traversal is needed.

There was a recent comment http://git-annex.branchable.com/internals/lockdown/#comment-f77526824d026f213ea98939fda9ac4c possibly on linux specific way, but I wondered: Why just not to take 'w' permission away from .git/annex/objects/aa/bb directories, and store files under. Yes -- for any get/drop that directory permission would need to flip to being writable but with proper guards around, and fsck checking, (or just adding some lock file which if not removed would signal that one of the underlying directories might have been left writable) I bet it should be feasible to make it work quite reliably without sacrificing FS meta-information real estate/performance?

allow specification of multiple downloads at once

Primarily came as a use case with extracting load from archives. If multiple files need to be "fetched" from e.g. tarball, it would be a costly operation to request one file at a time.

I have no clue if that is a reasonable thing to request from git-annex -- marking as such for now. Otherwise we might workaround simply by keeping/caching extracted archives locally for some duration (of 'get' command, or timing out), thus allowing for a simple cp/ln operation from that pool.

git-annex: custom downloaders

As we originally planned, we would need custom helpers to fetch data from e.g. image databases neither exposing their data via HTTP, nor providing a universal "key store" like facility, so prohibiting development of dedicated special remote handlers for them.

One way to mitigate this would be to have in git-annex some way to specify custom "downloaders" given a specific URI pattern. E.g. for URL regexp "http.*.torrent" still use built-in wget/curl for checking content presence and aria2c %(url)s for fetching the content. Such custom downloaders would also have their registry of authorization credentials for data sources requiring them. Also we could support extraction from locally present archives, via custom downloaders, see e.g.

Extract
on details.

If support of custom downloaders would not be implemented directly in git-annex, it might be implementable by datalad in many cases by "proxying" calls to wget and/or curl and then checking/fetching content accordingly.

Currently annex only has support for regular wget/curl and quvi (for youtube videos), with hardcoded logic.

Related:

provide decorator to swallow log messages to get them ignored or analyzed

now if we run

$> nosetests -s -v datalad/tests/                 
datalad.tests.test_annexrepo.test_AnnexRepo_instance_brand_new ... ok
datalad.tests.test_annexrepo.test_AnnexRepo_instance_from_clone ... 2015-02-25 13:19:13,261 [ERROR  ] 'git clone -v /home/yoh/proj/datalad/datalad/datalad/tests/testrepos/basic/r1 /home/yoh/.tmp/tmptoMerM' returned with exit code 128
| stderr: 'fatal: destination path '/home/yoh/.tmp/tmptoMerM' already exists and is not an empty directory.
| ' (gitrepo.py:59)
ok
...

we get them spit out to the screen, which is not proper since they are intended to error out at that point (right @bpoldrack ?). so we need to swallow them.
In fail2ban we have a derived TestCase class for that purpose: https://github.com/fail2ban/fail2ban/blob/HEAD/fail2ban/tests/utils.py#L200 which also allows to analyze those logs for testing (which we would need to). But here we better come up with a decorator which captures and exposes within the test the logs. Any takers? ;)

Integrate datalad into BCE

I found my way here via the software carpentry website. I don't know a better way to generically contact folks in the project, but feel free to be in touch via email (my github username @berkeley.edu).

We're working on a project to create a common baseline of packages at UC Berkeley called BCE (the Berkeley Common Environment). The tools you are building sound like exactly the direction we'd like to be going in for data management. So, "integrate" here is not a heavy demand, just making sure BCE supports the python packages, etc. that you're relying on.

Note that while we build ubuntu VMs, we don't use dpkg/apt to manage most of our python dependencies (there needs to be a strong reason to do so). We use pip. No slight intended against neurodebian, etc. - but this way you can still easily install the BCE python dependencies on, e.g., OS X. This is our current list of dependencies.

Anywho, git annex is awesome as a backend, but it's still too cumbersome to recommend to computational scientists who aren't necessarily "committed to the cause." I'd love to support your efforts to get more people using git and git annex in a sensible way for science!

It'd be great to get this integrated in BCE. Then folks could very easily learn how to use these tools using VirtualBox, or EC2, or Docker, or whatever else we end up supporting.

cc @aculich

datalad.wtf() and facility for tracking versions of external tools so code could workaround/warn accordingly

In PyMVPA we have this handy function which reports various information about code and externals, which helps to troubleshoot. It might be nice to adopt similar one here, but ideally probably through generalization of mvpa2.wtf (and externals etc) functionality in a separate project... Currently projects need to brew similar but different functionality for the same purpose. pandas and stats models has some already as well.

fetch complimentary materials for the PDF

Reviewing paper for PLOS and annoying thing is that it has plenty of "Supplementary" files which are linked in PDF but even with acroread -- doesn't go to any of those URLs for some reason... so here came ugly in implementation but beautiful in the result workaround

git init; git annex init; strings ../PONE-D-XXXX.pdf | grep URI.*editorialmanager | sed -e 's,.*/URI(\(http://www.edit.*\))>>.*,\1,g' | while read link; do fn=$(~datalad/datalad/tools/urlinfo -f $link);   git annex addurl --pathdepth=1 --file=$fn $link; done 

urlinfo was necessary because of #37 .

So I thought it might provide an interesting useful use-case for the crawler, where "provider" is a PDF document (instead of e.g. a website) -- the rest goes the same. Now I have fetched some tarballs to be extracted etc

openfmri

https://openfmri.org is a popular NSF-funded project to federate a rich collection of neuroimaging data from cognitive experiments.

Pros:

  • standardized layout
  • redundant and heterogeneous data hosting -- XNAT, Amazon S3 (versioning was already enabled for S3)

Cons:

  • redundant and heterogeneous data hosting -- care must be taken to establish proper "original data distribution" and make use of those additional data hosters

Windows: annex has issues with links in submodules so do not code testing get/install in submodule repos for now :-/

http://git-annex.branchable.com/bugs/downloads_load___40__from_url__41___to_incorrect_directory_in_a_submodule/

> git clone --recurse git://github.com/yarikoptic/datalad -b nf-test-repos
...
> cd datalad/datalad/tests/testrepos/basic/r1
> ls -l test-annex.dat
lrwxrwxrwx 1 yoh yoh 186 Feb 23 15:56 test-annex.dat -> .git/annex/objects/zk/71/SHA256E-s4--181210f8f9c779c26da1d9b2075bde0127302ee0e3fca38c9a83f5b1dd8e5d3b.dat/SHA256E-s4--181210f8f9c779c26da1d9b2075bde0127302ee0e3fca38c9a83f5b1dd8e5d3b.dat
> git co master    
Switched to branch 'master'

$> git annex get test-annex.dat
(merging origin/git-annex into git-annex...)
(recording state in git...)
get test-annex.dat (from web...) 
../../../../../.git/modules/datalad/te 100%[=============================================================================>]       4  --.-KB/s   in 0s     
ok
(recording state in git...)

# indeed upstairs
$> ls -l ../../../../../.git/modules/datalad/tests/testrepos/modules/basic/r1/annex/objects/zk/71/SHA256E-s4--181210f8f9c779c26da1d9b2075bde0127302ee0e3fca
38c9a83f5b1dd8e5d3b.dat/SHA256E-s4--181210f8f9c779c26da1d9b2075bde0127302ee0e3fca38c9a83f5b1dd8e5d3b.dat
-r-------- 1 yoh yoh 4 Feb 23 15:57 ../../../../../.git/modules/datalad/tests/testrepos/modules/basic/r1/annex/objects/zk/71/SHA256E-s4--181210f8f9c779c26da1d9b2075bde0127302ee0e3fca38c9a83f5b1dd8e5d3b.dat/SHA256E-s4--181210f8f9c779c26da1d9b2075bde0127302ee0e3fca38c9a83f5b1dd8e5d3b.dat

$> acpolicy git-annex
git-annex:
  Installed: 5.20150205+git57-gc05b522-1~nd80+1
  Candidate: 5.20150205+git57-gc05b522-1~nd1
  Version table:
 *** 5.20150205+git57-gc05b522-1~nd80+1 0
        500 http://neuro.debian.net/debian-devel/ jessie/main amd64 Packages
        100 /var/lib/dpkg/status

$> acpolicy git      
git:           
  Installed: 1:2.1.4-2.1
  Candidate: 1:2.1.4-2.1
  Version table:
     1:2.1.4+next.20141218-2 0
        300 http://http.debian.net/debian/ experimental/main amd64 Packages
 *** 1:2.1.4-2.1 0
        900 http://http.debian.net/debian/ jessie/main amd64 Packages
        600 http://http.debian.net/debian/ sid/main amd64 Packages
        100 /var/lib/dpkg/status

1000genome

not neuroimaging but a good/interesting use-case (versioned files via suffixes) available via aspera or S3 (bucket without versioning. versioning was not enabled even though advised in private correspondence). Locations: http://ftp.ncbi.nlm.nih.gov/1000genomes/ and s3://1000genomes
Contains only 462826 (including versioned copies) of files on s3 atm.

parallel download of a file from multiple URLs (and special remotes?)

just an idea
e.g. aria2c can download a file from multiple urls at once. This can mitigate limited/congested bandwidth

It would be nice if datalad, or even better with native git-annex support, could allow/try to get/download content from multiple urls at once (if multiple URLs assigned).

Related discussions/posts on git-annex project pages:
"Downloading files from multiple git-annex sources simultaneously"
http://git-annex.branchable.com/todo/Bittorrent-like_features/#index1h1

Make nipype more conscious about dead symlinks

I am not exactly sure what the right solution would be. If nipype's DataGrabber/Finder discovers a dead symlink it should at least complain/warn. But apart from that I am not certain whether it should refuse to run if in the matched set is a dead symlink, or whether it should exclude them, because they are practically not present.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.