datalad / datalad Goto Github PK

View Code? Open in Web Editor NEW

493.0 27.0 111.0 40.36 MB

Keep code, data, containers under control with git and git-annex

Home Page: http://datalad.org

License: Other

Python 98.64% Makefile 0.05% Shell 1.23% PowerShell 0.01% Batchfile 0.03% Singularity 0.03% Jinja 0.01%

python git-annex data-storage dataset usable closember

datalad's Issues

allow specification of multiple downloads at once

Primarily came as a use case with extracting load from archives. If multiple files need to be "fetched" from e.g. tarball, it would be a costly operation to request one file at a time.

I have no clue if that is a reasonable thing to request from git-annex -- marking as such for now. Otherwise we might workaround simply by keeping/caching extracted archives locally for some duration (of 'get' command, or timing out), thus allowing for a simple cp/ln operation from that pool.

fetch complimentary materials for the PDF

Reviewing paper for PLOS and annoying thing is that it has plenty of "Supplementary" files which are linked in PDF but even with acroread -- doesn't go to any of those URLs for some reason... so here came ugly in implementation but beautiful in the result workaround

git init; git annex init; strings ../PONE-D-XXXX.pdf | grep URI.*editorialmanager | sed -e 's,.*/URI(\(http://www.edit.*\))>>.*,\1,g' | while read link; do fn=$(~datalad/datalad/tools/urlinfo -f $link);   git annex addurl --pathdepth=1 --file=$fn $link; done

urlinfo was necessary because of #37 .

So I thought it might provide an interesting useful use-case for the crawler, where "provider" is a PDF document (instead of e.g. a website) -- the rest goes the same. Now I have fetched some tarballs to be extracted etc

ENCODE datasets

A lovely annexificator already exists:
https://github.com/detrout/encode-annex
so worth considering this use-case as well

parallel download of a file from multiple URLs (and special remotes?)

just an idea
e.g. aria2c can download a file from multiple urls at once. This can mitigate limited/congested bandwidth

It would be nice if datalad, or even better with native git-annex support, could allow/try to get/download content from multiple urls at once (if multiple URLs assigned).

Related discussions/posts on git-annex project pages:
"Downloading files from multiple git-annex sources simultaneously"
http://git-annex.branchable.com/todo/Bittorrent-like_features/#index1h1

create slim version of gump dataset used by testkraut test

takes a while on each clone etc.

as I have hinted in the mail should just provide minimalistic view of that repository with only files necessary for the (unit)test

consider making output of 'git-annex get' less verbose

ATM it is just output of wget, so screen gets filled up with all kinds of stuff quite quickly ruining some times precious terminal history . In the light of #10 it might be worth considering making default fetching of data less verbose and more or less 'Ok' or 'Fail' status being reported. Also I somewhat like how docker manages to report its progress in the terminal (curses?) where multiple lines show progress and not that much of terminal screen is wasted.

aspera support

Aspera is a company (http://asperasoft.com , apparently belonging to IBM now) developing products for the high-performance transfer -- it intends to fill up the pipe even while going through the WAN . Currently used by NIH (for NCBI) and HCP.

Pros:

efficient transfer

Cons:

closed source proprietary patented technology
even clients are closed source
freely available client might be difficult to "drive" due to a rich variety of settings/options etc

Theoretically nothing forbids supporting it, but logistically it might be very messy (see cons).

Windows: annex has issues with links in submodules so do not code testing get/install in submodule repos for now :-/

http://git-annex.branchable.com/bugs/downloads_load___40__from_url__41___to_incorrect_directory_in_a_submodule/

> git clone --recurse git://github.com/yarikoptic/datalad -b nf-test-repos
...
> cd datalad/datalad/tests/testrepos/basic/r1
> ls -l test-annex.dat
lrwxrwxrwx 1 yoh yoh 186 Feb 23 15:56 test-annex.dat -> .git/annex/objects/zk/71/SHA256E-s4--181210f8f9c779c26da1d9b2075bde0127302ee0e3fca38c9a83f5b1dd8e5d3b.dat/SHA256E-s4--181210f8f9c779c26da1d9b2075bde0127302ee0e3fca38c9a83f5b1dd8e5d3b.dat
> git co master    
Switched to branch 'master'

$> git annex get test-annex.dat
(merging origin/git-annex into git-annex...)
(recording state in git...)
get test-annex.dat (from web...) 
../../../../../.git/modules/datalad/te 100%[=============================================================================>]       4  --.-KB/s   in 0s     
ok
(recording state in git...)

# indeed upstairs
$> ls -l ../../../../../.git/modules/datalad/tests/testrepos/modules/basic/r1/annex/objects/zk/71/SHA256E-s4--181210f8f9c779c26da1d9b2075bde0127302ee0e3fca
38c9a83f5b1dd8e5d3b.dat/SHA256E-s4--181210f8f9c779c26da1d9b2075bde0127302ee0e3fca38c9a83f5b1dd8e5d3b.dat
-r-------- 1 yoh yoh 4 Feb 23 15:57 ../../../../../.git/modules/datalad/tests/testrepos/modules/basic/r1/annex/objects/zk/71/SHA256E-s4--181210f8f9c779c26da1d9b2075bde0127302ee0e3fca38c9a83f5b1dd8e5d3b.dat/SHA256E-s4--181210f8f9c779c26da1d9b2075bde0127302ee0e3fca38c9a83f5b1dd8e5d3b.dat

$> acpolicy git-annex
git-annex:
  Installed: 5.20150205+git57-gc05b522-1~nd80+1
  Candidate: 5.20150205+git57-gc05b522-1~nd1
  Version table:
 *** 5.20150205+git57-gc05b522-1~nd80+1 0
        500 http://neuro.debian.net/debian-devel/ jessie/main amd64 Packages
        100 /var/lib/dpkg/status

$> acpolicy git      
git:           
  Installed: 1:2.1.4-2.1
  Candidate: 1:2.1.4-2.1
  Version table:
     1:2.1.4+next.20141218-2 0
        300 http://http.debian.net/debian/ experimental/main amd64 Packages
 *** 1:2.1.4-2.1 0
        900 http://http.debian.net/debian/ jessie/main amd64 Packages
        600 http://http.debian.net/debian/ sid/main amd64 Packages
        100 /var/lib/dpkg/status

Integrate datalad into BCE

I found my way here via the software carpentry website. I don't know a better way to generically contact folks in the project, but feel free to be in touch via email (my github username @berkeley.edu).

We're working on a project to create a common baseline of packages at UC Berkeley called BCE (the Berkeley Common Environment). The tools you are building sound like exactly the direction we'd like to be going in for data management. So, "integrate" here is not a heavy demand, just making sure BCE supports the python packages, etc. that you're relying on.

Note that while we build ubuntu VMs, we don't use dpkg/apt to manage most of our python dependencies (there needs to be a strong reason to do so). We use pip. No slight intended against neurodebian, etc. - but this way you can still easily install the BCE python dependencies on, e.g., OS X. This is our current list of dependencies.

Anywho, git annex is awesome as a backend, but it's still too cumbersome to recommend to computational scientists who aren't necessarily "committed to the cause." I'd love to support your efforts to get more people using git and git annex in a sensible way for science!

It'd be great to get this integrated in BCE. Then folks could very easily learn how to use these tools using VirtualBox, or EC2, or Docker, or whatever else we end up supporting.

cc @aculich

restricted permissions to content from annex "published" online might result

as discovered with psychoinformatics-de/studyforrest-www#5

originally had read/write for owner/group, was pushed to public server with that group missing

I guess ideally git-annex could have an option for "public" publishing of the annex?

Composite (git submodules?) dataset handles with support for the views

As HCP and NCANDA (via @nicholsn with 10mil files) shows, it might be in-feasible to aim supporting shipping the entire beast in a single git/git-annex repository, which suggests modularizing at least at the level of a single subject (or may be subject/modality or subject/visit) to be managed with e.g. git submodules or some "native" datalad distribution way. But then such features as "views" of git-annex do not want to be lost since they provide really nice functionality to reexpose that data in an analysis specific convenient layout (given the meta-data is contained within annex). So this all requires additional thought on how to go about.

@joeyh -- any ideas on how to go about growing large repositories partitioning/handling?

git-annex diff

With pluggable (similar to how git does) comparison tools... reviving an old dialog in Jan 2014 which finished on [email protected]

FOI: httpa - HTTP with Accountability. Ongoing project from W3C

See http://newsoffice.mit.edu/2014/whos-using-your-data-httpa-0613

Something to know about/keep in mind

git-annex addurl should respect Content-Disposition filename

as far as I see it on quick code grep or on this test -- doesn't have an option to use those:

$> mkdir test; cd test; git init; git annex init;  git annex addurl --pathdepth=-1  http://human.brain-map.org/api/v2/well_known_file_download/157722290
Initialized empty Git repository in /tmp/test/.git/
init  ok
(Recording state in git...)
addurl 157722290 (downloading http://human.brain-map.org/api/v2/well_known_file_download/157722290 ...) 
/tmp/test/.git/annex/tmp/URL--http&c%%huma     [                                   <=>                                                    ]   3.07M   391KB/s   in 8.3s   
ok
(Recording state in git...)

$> ls 
157722290@

$> ~datalad/datalad/tools/urlinfo http://human.brain-map.org/api/v2/well_known_file_download/157722290
URL:  http://human.brain-map.org/api/v2/well_known_file_download/157722290
Date: Wed, 21 Jan 2015 18:04:35 GMT
Server: Apache
Content-Disposition: attachment; filename="T1.nii.gz"
Content-Transfer-Encoding: binary
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: POST, GET, OPTIONS
Access-Control-Max-Age: 1728000
Cache-Control: private
X-UA-Compatible: IE=Edge,chrome=1
X-Request-Id: aa0a12faabba30384fc13fc9fecda5e2
X-Runtime: 0.007028
X-Rack-Cache: miss
Status: 200 OK
Content-Type: application/octet-stream
Vary: Accept-Encoding
Content-Encoding: gzip
Connection: close
Transfer-Encoding: chunked
Set-Cookie: BIGipServerHuman_Pool=2502510346.20480.0000; path=/

datalad.wtf() and facility for tracking versions of external tools so code could workaround/warn accordingly

In PyMVPA we have this handy function which reports various information about code and externals, which helps to troubleshoot. It might be nice to adopt similar one here, but ideally probably through generalization of mvpa2.wtf (and externals etc) functionality in a separate project... Currently projects need to brew similar but different functionality for the same purpose. pandas and stats models has some already as well.

publish for backup

one of the problems many labs might have is limited if any offsite backup of their data. Given large amounts of processed etc data, it might be worth having a mode where we "publish" repositories to e.g. external hard drive copying only "precious" data (e.g. original raw/preprocessed). git-annex has notion of "preferred content" but it is assigned per "source" repository where we would need something like "preferred content to publish to X" or just declare some files/directories "precious" (may be just a tag) Then it would be useful to allow incremental updates of backups on external drive stating e.g. "publish --to=/media/drive --tags=precious"

@joeyh - what would be your thoughts on how to wrap it up? tags?

Check with git folks if pre-add hook is feasible and if so -- do it

That could allow use of largefiles or whatever else to simplify interactions with annex -- they would just automagically added to annex (instead of index) in 'pre-add' hook, and/or may be user be alerted that he is trying to commit a large file directly to git, while it is not matching largefiles selection, ...
Or am I dreaming @joeyh? ;-)

operation to "get" content for files which previously had content fetched

Imagine an annex repository which had some files' content fetched (via 'annex get'). Then modifications to some of those and may be other files were done in the original repository, so if a plain "git pull" is performed, symlinks to new content would get broken and would require manual "git annex get" on those files which previously were "get"ed at some point in the past. AFAIK (and from our discussion on IRC, cited below, thanks patagonicus and scalability-junk for discussion) there is no way currently to achieve that without additional script on top in two possible ways:

collecting list of files with local content before "pull" and then "get"ing them after
retrospection of git/git-annex history for each file without content either it was previously obtained and its content key not dropped later on explicitly.

FWIW -- here is a protocol of IRC session

10:05   yoh: is there a way to 'upgrade' the content which was 'got' already?  e.g. I have some files which got their 
             content delivered to local annex.  then I want to 'git pull' + 'git annex get 
             those_files_for_which_I_had_content'
10:07   patagonicus: yoh: Probably git annex merge, if pull already got all the branches it needs. No need to do an 
                     extra get, if the data was already copy --to the repo (merge will update your master branch so 
                     that the new file's symlinks appear and the link target will already be there)
10:08   warp: just do git annex sync?
10:09   scalability-junk: yeah not sure why someone would manually git pull and push with git annex sync.
10:09   bremner: sync either gets all content or none
10:09   scalability-junk: bremner: nope it gets all metadata aka branches and git stuff
10:09   scalability-junk: content is only synced with git annex sync --content
10:09   bremner: yes, and then it syncs all content, like I just said
10:09   scalability-junk: and even then it's not all content, but the content which is preferrer or has not 
                          enough duplicates
10:10   patagonicus: bremner: No, it's based on preferred content
10:10   scalability-junk: *preferred
10:10   bremner: ok, fine. It _still_ doesn't answer yoh's question.
10:10   patagonicus: But sync does push/pull all branches, so if you want to only sync some. And if you want 
                     to rewrite history you'll probably have to fetch/push --force and do some resets.
10:11   scalability-junk: rewriting history is a pain
10:11   yoh: keep also in mind that sync tries to be bidirectional -- I want a clean one-way.
10:13   scalability-junk: Wasn't said. Alright so yeah git merge probably
10:18   yoh: checked -- nope -- merge didn't get it and there is no --content for it
10:19   patagonicus: yoh: Are the symlinks there? And you already pushed that file with git annex copy --to 
                     from a different repo?
10:19   yoh: I am just trying with two local repos I made for testing... there is no need to copy --to
10:20   patagonicus: Eh … then how was the content "'got' already"?
10:20   yoh: ok - 1 sec
10:23   yoh: eh -- history a bit messy to share... s215bLgyFs
10:23   yoh: http://slexy.org/view/s215bLgyFs
10:23   yoh: so this way ;)  after initial clone I 'get' some interesting file.  then if they get modified 
             in origin, I would like to "update" them locally as well, but only them
10:24   yoh: merge itself doesn't even pull... may be I should fetch before and then merge,... let's see
10:24   patagonicus: Yeah, merge does not do any fetch/pull/push. It just uses the synced/* branches that 
                     are available to update master and git-annex
10:25   patagonicus: You have no line that would transfer the "123" content from d1 to d2, so it's not there.
10:26   patagonicus: You'll have to run a git annex get afterwards. Or add d2 as a remote to d1 and run git 
                     annex copy --to=d2 at any time.
10:26   yoh: I understand that... and that is what I would like to achieve -- that some command does that 
             content transfer for those files which had local content before
10:26   patagonicus: So, basically git annex sync --content without pushing to the remote?
10:26   yoh: sync  would sync/get all the datafiles
10:27   yoh: I would like only those present on client already (server shouldn't be aware of the client, so 
             no copy --to)
10:28   patagonicus: What do you mean with present? Based on the file names or on the content? Because in 
                     your example you have 1 file but two contents (so two different keys in annex' storage. 
                     One content will be present, one (the 123 one) will not).
10:29   yoh: on the file names
10:30   patagonicus: Oh. That's going to be complicated, I think. So if there's two files, A and B, and the 
                     client has the content of A and both A and B get changed on the server, only the new 
                     content for A should be made available on the client?
10:30   yoh: yes
10:31   yoh: I guess could also be done via retrospection in git/git-annex
10:33   patagonicus: There's no such feature built into annex at the moment (and I'm not sure if it ever 
                     will be added). Basically what you want to do is: run git annex find (which will list 
                     all files currently in the repo, probably with --print0), run git annex merge to update 
                     master, the run git annex get with the result of the find you previously did. That 
                     wouldn't catch file renaming, but for the basic case
10:34   patagonicus: it should work. Needs N*M space, though, where N is the number of files in the local 
                     repo and M is the average file name length.
10:34   patagonicus: I think git annex find --print0 >here && git annex fetch -a && git annex merge && xargs 
                     -0 git annex get <here or something like that should work.
10:35   patagonicus: Will break if there's ever a merge in which file content was changed that is done 
                     without the find/get pair.
10:35   yoh: yeap, something like that
10:36   yoh: because of such corner-cases I think it might be better to do it (optionally?) via full 
             retrospection -- if each file without content ever had content before and was not explicitly 
             dropped
10:37   patagonicus: Then the runtime would depend on the number of commits on master (times the number of files).
10:37   patagonicus: Should be doable with a bit of scripting, though.
10:38   patagonicus: The "explicetly" dropped would be harder. Basically you could only find out if any of 
                     the previous versions of a file is still in the local repo. However, if you keep 
                     running git annex unused && git annex dropunused all those would never be in the local 
                     repo.
10:39   yoh: patagonicus: yes, I know.  But such full retrospection could then be optional for explicit run 
             and/or called if manual operation was detected... may be there could even be some record of a 
             last state when things were "in order" to start from that point

allow to drop "old" files (e.g. not referenced after commit X)

I wondered if there is a convenience construct already which would allow to drop (may be forcefully) some files which are no longer "of interest".

use case: carrying out an analysis while keeping results under git-annex control. With reiterations of the analyses a pile of results accumulates but may be majority of them is not even worth keeping at all (not just locally). It would be nice to have ability to drop all the load which e.g. is not referenced by any commit after X (in a given branch, situation with multiple branches might be trickier)

'{' is not recognized as an internal or external command

======================================================================
FAIL: Verify that all our repos are clonable
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "c:\buildslave\datalad-tests-virtualbox-dl-win7-64\build\datalad\tests\utils.py", line 349, in newfunc
    t(repo, *arg, **kw)
  File "c:\buildslave\datalad-tests-virtualbox-dl-win7-64\build\datalad\tests\utils.py", line 239, in newfunc
    t(*(arg + (filename,)), **kw)
  File "c:\buildslave\datalad-tests-virtualbox-dl-win7-64\build\datalad\tests\test_testrepos.py", line 38, in test_clone
    eq_(status, 0, msg="Status: %d  Output was: %r" % (status, output))
AssertionError: Status: 1  Output was: "'{' is not recognized as an internal or external command,\noperable program or batch file.

Message seems to be way off for figuring out WTF. Reference:
http://smaug.datalad.org:8020/builders/datalad-tests-virtualbox-dl-win7-64/builds/3/steps/nosetests/logs/stdio

XNAT

http://xnat.org is the most widely used FOSS platform for neuroimaging data management and distribution. Used by openfmri, nitrc, humanconnectome, etc.

Pros:

python-xnat provides programmable interface so should be relatively straightforward

TODOs:

no built-in facility for versioning data files

COINS

http://coins.mrn.org is a FOSS-wannabe platform used by many researchers.

TODOs

since it is not yet open-source, not aware of its authentication/etc API
versioning of data files is absent AFAIK

Issue to be closed during #closember 2021 at the latest.

More fine-grained access permissions

Per discussion with @nicholsn -- some use cases might require more fine-grained permission for access of sub-portions of the dataset (e.g. just behavioral data, or just anatomicals). Most probably could be achieved with modularization going beyond subject level (thus #40 was mentioned)

gets stuck on travis while running annex command

All recent build seems to get stuck at

https://travis-ci.org/datalad/datalad/builds/37287440#L195

2014-10-07 13:25:49,086 [DEBUG  ] Running: cd /tmp/tmpDMOIrj && git annex add  -c annex.alwayscommit=false "files/1" (cmd.py:100)

and since we are 'sponging' all output out before we can see it, adding --debug to git annex call didn't help to provide more information....
any ideas on how to troubleshoot this one @datalad/developers ?

various tests fail under Windows

This is just a placeholder to point to the problem of failing tests (such as #67) on Windows. Some of them are due to max path length limit (#58) but there might be more. Just requires a thorough pass
Reference: http://smaug.datalad.org:8020/builders/datalad-tests-virtualbox-dl-win7-64/builds/3/steps/nosetests/logs/stdio

give newly created repositories non-default identifier

now it is USER@HOST:PATH

Make nipype more conscious about dead symlinks

I am not exactly sure what the right solution would be. If nipype's DataGrabber/Finder discovers a dead symlink it should at least complain/warn. But apart from that I am not certain whether it should refuse to run if in the matched set is a dead symlink, or whether it should exclude them, because they are practically not present.

git-annex add (git ls-files) etc performance on laaarge datasets

Just a topic for possible discussion etc

Just for the sake of seeing how big would get a really big dataset, I am "simulating" an annex repository with all the files distributed by HCP 500 subjects release, which, as deposited to S3, has >5,600,000 files. It seems to be running for a while now ;)
Some of the initial bottlenecks I have detected:

git-annex add invokes 'git ls-files --others --exclude-standard -z' which is apparently CPU-hungry (not bound by IO) and does quite a bit of exploration (e.g. for .git in every subdirectory). Might be worth profiling and getting some stats on what it spends time on. or if 'annex add' had option '--initial' or '--plain-ls' which would avoid using ls-files (but might have side-effects etc)
further in the process, when actually files get annexed, git-annex itself uses some locking mechanism so is quite busy opening/closing(?, just looking at fds) journal.lck . I wonder if the process of 'annex add'ing could somehow be parallelized if not made more efficient for such obnoxious cases

The script I am using is here: https://github.com/datalad/datalad/blob/master/tools/mimic_repo (really not sure why I did it in python instead of a simple bash script). and here are the corresponding 's3cmd ls' outputs I am running on:
http://www.onerussian.com/tmp/hcp-ls.20141020-500subject-only.txt.gz
or only just a 1000 (to give a try)
http://www.onerussian.com/tmp/hcp-ls.20141020-500subject-only1000.txt.gz

so you could run e.g. with

./mimic_repo hcp-ls.20141020-500subject-only1000.txt.gz /tmp/testtt

CERN's Zenodo

Provides portal to upload software/data with versioning etc. See
https://zenodo.org/search?p=980__a:dataset
so might be useful to keep in mind at least for "crawling" and exposing via annex, ideally -- for publishing to ;)

CRCNS.org support

During SfN Jeff Teeters shared good news on available of alternative methods of downloading datasets from the crcns.org which should support versioning etc, research more in detail:
http://crcns.org/download

with_tempfile needs to be chainable

probably by taking *_kwargs and pop'ing only a limited set of arguments, passing *args and the rest of *_kwargs deeper inside

test also by nesting multiple with_tempfile's

seems to be lacking last 'commit'

when created http://github.com/datalad/brainfacts--2012-edition index for git-annex branch with all the urls was left uncommitted

1000genome

not neuroimaging but a good/interesting use-case (versioned files via suffixes) available via aspera or S3 (bucket without versioning. versioning was not enabled even though advised in private correspondence). Locations: http://ftp.ncbi.nlm.nih.gov/1000genomes/ and s3://1000genomes
Contains only 462826 (including versioned copies) of files on s3 atm.

expose "S3 client" capabilities to custom special remotes

a really wild question: to provide access to S3 buckets not exposed publicly (thus requiring authentication via IAM credentials, such as HCP) and through the prefix/revisionids, instead of just to a pure keystore (as in native git-annex S3 special remote), we would need to provide an external special remote which would need again to implement S3 client capabilities for basic authentication/fetching. It can be done but we will grow dependencies implementing the same functionality as what git-annex already has inside.
So a wild thought came: may be git-annex could somehow expose S3 functionality to e.g. external special remotes through some API? ;) could well be made into something even more generic. git-annex has already capabilities for talking to a plethora of data hosting providers and thus may be could serve as a glorified "ultimate downloader"? or to say may be to expose itself as yet another special remote capable of TRANSFER?

fixup escape codes (or disable coloring altogether) on Windows

git-annex: custom downloaders

As we originally planned, we would need custom helpers to fetch data from e.g. image databases neither exposing their data via HTTP, nor providing a universal "key store" like facility, so prohibiting development of dedicated special remote handlers for them.

One way to mitigate this would be to have in git-annex some way to specify custom "downloaders" given a specific URI pattern. E.g. for URL regexp "http.*.torrent" still use built-in wget/curl for checking content presence and aria2c %(url)s for fetching the content. Such custom downloaders would also have their registry of authorization credentials for data sources requiring them. Also we could support extraction from locally present archives, via custom downloaders, see e.g.

datalad/docs/design.rst

Line 277 in 54b71be

Extract

on details.

If support of custom downloaders would not be implemented directly in git-annex, it might be implementable by datalad in many cases by "proxying" calls to wget and/or curl and then checking/fetching content accordingly.

Currently annex only has support for regular wget/curl and quvi (for youtube videos), with hardcoded logic.

all issues tagged as new-datatransfer-protocol https://github.com/datalad/datalad/issues?q=is%3Aopen+is%3Aissue+label%3Anew-datatransfer-protocol
Using an external client (addurl torrent support)
http://git-annex.branchable.com/todo/Bittorrent-like_features/#index2h1

Mitchell's dataset

http://www.cs.cmu.edu/afs/cs/project/theo-73/www/science2008/data.html

a collection of .mat files + scripts in .tar.gz
just thought it might be nice to re-distribution it thus decided to issue it here

Check git annex' error reporting

By now this intended to be a reminder to figure things out.
Just stumbled upon a case where I git annex init a repo and then got "init ok" on stdout, "fatal: ref HEAD is not a symbolic ref" on stderr and exitcode 0 (or at least None).
So, datalad didn't notice something went wrong. Needs further investigation.

openfmri

https://openfmri.org is a popular NSF-funded project to federate a rich collection of neuroimaging data from cognitive experiments.

Pros:

standardized layout
redundant and heterogeneous data hosting -- XNAT, Amazon S3 (versioning was already enabled for S3)

Cons:

redundant and heterogeneous data hosting -- care must be taken to establish proper "original data distribution" and make use of those additional data hosters

NCBI Sequence Read Archive (SRA)

Use case, again not neuroimaging but interesting, brought up by Don Armstrong
http://www.ncbi.nlm.nih.gov/sra
sra-toolkit providing tools for conversion and visualization is in Debian

BUG: os.link is N/A on windows

Initial use was to link load from incoming to public repositories...

git-annex --help is using unix-style filenames thus not working on windows

http://www.onerussian.com/tmp/gkrellShoot_01-29-15_103025.png

create temp directories under windows not under any deep directory

to mitigate limitation on the maximum path length (up to 260 chars total)

slow basic FS operations on large git-annex repositories .git/annex'es due to KEY/ directories

echoing discussion we are having in #17, I wanted to also bring one about non-direct (i.e. regular/classical) annex'es and their impact. As http://git-annex.branchable.com/internals/ and http://git-annex.branchable.com/internals/lockdown/ outlines, to prevent accidental removal of the file, every file is placed under .git/annex/objects/aa/bb/KEY/KEY directory. where for KEY directory 'w' permission is taken away so the KEY file can't be removed. It indeed solves the accidental removal issue, but

it nearly doubles inodes needed for the repository. So on large ones, impact on the file system might be notable (on smaug I have 100mil inodes in use now :-/ could have been just 50mil? ;))
operations (even a simple du or ls -R) might take MUCH longer since leaf directories are to contain a single file and thus there is quite a heavy file hierarchy traversal is needed.

There was a recent comment http://git-annex.branchable.com/internals/lockdown/#comment-f77526824d026f213ea98939fda9ac4c possibly on linux specific way, but I wondered: Why just not to take 'w' permission away from .git/annex/objects/aa/bb directories, and store files under. Yes -- for any get/drop that directory permission would need to flip to being writable but with proper guards around, and fsck checking, (or just adding some lock file which if not removed would signal that one of the underlying directories might have been left writable) I bet it should be feasible to make it work quite reliably without sacrificing FS meta-information real estate/performance?

academictorrents support

Now that annex got built-in support for torrents (http://git-annex.branchable.com/devblog/day_239-240__bittorrent_remote/) we might consider exposing torrents from http://academictorrents.com/browse.php?cat=6 Chris's test-retest and few other related datasts (http://academictorrents.com/browse.php?search=fmri&c6=1) are available there

Facilitate space-efficient throw-away clones of dataset handles

If I have a big dataset and I want to do a number of analyses with multiple users in a shared computing environment, it would be nice to be able to do that in a way that has minimal impact on storage demands.

In my concrete case, I have about 100GB of raw input data that is currently copied for each clone of the dataset handle. Of course I can avoid that by manually hard/soft linking the relevant files, and only later inject potential results of derived data into the original annex. However, this breaks the workflow with dataset handles.

Redefine prototype.

Not really an issue, just a comment on your top-level readme:

It is currently in a "prototype" state, i.e. a mess.

that's classic!

Allow for "lean" annex views with only files having a content

NB I thought I have filed it here but it must have been elsewhere, so backlink will be provided later

git annex view provides great functionality to take advantage of meta-data (tags) associated with data files for custom views. One of the stumble points might often be "installation" of laarge datasets where only a handful of files are actually needed/used. ATM it results in a directory hierarchy where possibly majority of files are broken links, which makes navigation difficult and non-productive.

It would be great if there was a way to generate a "lean" view where only files with content available are visible.

Somewhat related would be #6, i.e. to carry an update while maintaining a lean view

"Reviewer mode"

It is often needed to provide anonymous access to data (or a subset) for peer-review. We should have (at least) some documentation on how to achieve this with datalad.

provide decorator to swallow log messages to get them ignored or analyzed

now if we run

$> nosetests -s -v datalad/tests/                 
datalad.tests.test_annexrepo.test_AnnexRepo_instance_brand_new ... ok
datalad.tests.test_annexrepo.test_AnnexRepo_instance_from_clone ... 2015-02-25 13:19:13,261 [ERROR  ] 'git clone -v /home/yoh/proj/datalad/datalad/datalad/tests/testrepos/basic/r1 /home/yoh/.tmp/tmptoMerM' returned with exit code 128
| stderr: 'fatal: destination path '/home/yoh/.tmp/tmptoMerM' already exists and is not an empty directory.
| ' (gitrepo.py:59)
ok
...

we get them spit out to the screen, which is not proper since they are intended to error out at that point (right @bpoldrack ?). so we need to swallow them.
In fail2ban we have a derived TestCase class for that purpose: https://github.com/fail2ban/fail2ban/blob/HEAD/fail2ban/tests/utils.py#L200 which also allows to analyze those logs for testing (which we would need to). But here we better come up with a decorator which captures and exposes within the test the logs. Any takers? ;)

datalad / datalad Goto Github PK

datalad's Issues

Recommend Projects

Recommend Topics

Recommend Org