datalad / datalad-metalad Goto Github PK

View Code? Open in Web Editor NEW

11.0 9.0 11.0 1.4 MB

Next generation metadata handling

License: Other

Makefile 0.19% Python 99.44% Batchfile 0.01% Shell 0.36%

closember metadata datalad

datalad-metalad's People

Stargazers

Watchers

Forkers

bpoldrack brainlife adswa christian-monch kyleam jsheunis standardgalactic yarikoptic mslw candleindark asmacdo

datalad-metalad's Issues

Ontology for filesystem related concept

This is the best I could find so far: http://oscaf.sourceforge.net/nfo.html

Anything better available?

aggregate-metadata --update-mode all --incremental labs didn't pick up new metadata from labs

What is the problem?

labs/ had an uninstalled dataset... I've installed it, aggregated within that dataset (recursive), then within labs with --incremental, then within the top one pointing to labs and also --incremental. It did save changes to labs/ but didn't aggregate (include) any new metadata from labs

FWIW, "cpu" time spent was reported to be 13sec, while wall clock over 3min. I guess due to all the "git status" checking on the mighty superdataset

Update acknowledgements

(Config) option to skip/augment metadata (re)extraction if some content is not present etc

For many datasets we provide from http://datasets.datalad.org we do keep original data locally so I could simply re-extract/aggregate when extractors change. But for many we do not keep all the data since it is impossible/expensive. Ideally it is desired to be able to say

skip (re)extraction if some files are missing content locally. Aggregation could remain to the generic logic thus possibly sticking to the "old" state of metadata
get the data first and then run extraction. (Optionally?) drop the data after done
reextract using only extractors (well, annex only ATM) which do not need content, which we mark with NEEDS_CONTENT. It should only be in effect when it is reextraction, i.e. when metadata for that state was already extracted before. Since ATM it is of use only for annex extractor, not sure how useful it would be in general .
a bit relaxed above, may be we could allow for reextraction only for a specified list of extractors? (e.g. if I want to extract only annex extractor metadata)
incremental-extraction . Currently the "repository state" includes all the dates in annex metadata into its estimation. Not sure if that is more of a hurdle than of a benefit since now any change in annex metadata state (which wouldn't need content to extract) would demand other extractors to get data to be able to reextract/aggregate metadata without breaking consistency. In principle though, we could become smart and analyze the diff between previous/current states and skip extractors if only the files for which they had no metadata at all were modified/removed. Could be yet another mode of operation I guess (or built in into default logic)

Metadata extractor duties wishy-whashy

The datalad_core extractor implementation carries a conditional section for annex repos that reports git annex whereis info, and for plain git repos it yields empty property dicts. I think I remember how it got there, but it really doesn't make any sense.

This functionality needs to move into the annex extractor, the is only enabled for annex repos.

The rest of the extractor will take care of the structure outlined in #51

aggregate-metadata should not reaggregate if superds' extracts on subds are "up to date"

What is the problem?

Follow up to #173, for now without details - just making a formal record from the memory of the experience.

In such case where subdataset actually doesn't contain any aggregated metadata, we shouldn't demand reaggregation upon not seeing a matching metadata extracted (ds-, cn-) files, whenever the state of the subdataset is still consistent with what it knows and subdataset is installed.
Might relate to my recent changes in datalad/datalad#3007

Capture information to be compatable with HCLS Dataset Descriptor

Would it be possible to use the metadata in datalad to generate a W3C Dataset Descriptor?

Here is a link to the spec preview

(Optionally?) .xz ds- files as well

Use case - HCP: datalad-datasets/human-connectome-project-openaccess#7, where majority of metadata comes from ds- files which in current datalad metadata handling way aren't compressed. May be we could also use .gitattributes to assign configuration per file(s) pattern on what to compress or not.
Not sure if metalad's approach to them is different, so may be this issue is not pertinent, wanted to ask.

"directory" level for Metadata?

What is the problem?

Some datasets potentially aren't getting split into proper subdatasets for one reason (distributed as a single tarball with all those subdirectories) or another ("didn't think about it"). E.g. we have http://datasets.datalad.org/?dir=/dicoms/rosetta/ where theoretically subdasets could be at the 2nd level of subdirectories to correspond to a sample from a particular scanner. When we do queries/reporting we can only report on a file or dataset and not per directory.
I wondered if it it could be somehow allowed for some directories to provide their own level of aggregation (indicated e.g. by having .datalad/ subdir, possibly with .datalad/config which would prescribe extractors)
Note: that a dataset is also a directory ;)

Where is the code with the dataset catalog generator demo?

It is here datalad/datalad-revolution#76

(mainly a note to self)

aggregate-metadata --incremental? (AKA --inner-since)

What is the problem?

Relevant:

datalad/datalad#850 --since which is largely to figure out which datasets to perform metadata-aggregation on since they have changed
datalad/datalad#1902 --skip-aggregated is in the same vain again at the level of datasets

We talked about it before (and in the recent proposal) but I think we there is no dedicated issue for the discussion.
Now that I am fetching a few TBs of data for datasets for which I have dropped that data content since we cannot actually keep all of it locally all the time (wouldn't scale), I saw the need for metadata aggregation to become "incremental" in default mode of aggregation. Not only that with --since we could skip reaggregation of datasets which didn't change since the last aggregation, but similar analysis should be done on per-file level considering the diff between the revisions. In general it could end up being a part of the --since operation within the dataset (hence --inner-since in the title ;-)), but I wanted to summarize it in a different issue since its implementation is trickier.

So it would be nice if extractors could be aware of the need to reaggregate all of the data (extractor-specific version within the extracted metadata file? ;-) ), and skip those files which were already aggregated and for which content might no longer be locally available. We could add a config or an options to get the content needed to get metadata extracted/updated (most likely that one would be locally available since re-aggregation would be happening shortly after new content added)

force reaggregation.

Support for incremental operation would be trivial for some extractors (all the file-based nifti1 etc) and not as trivial for the ones considering the files layout to be a part of the metadata (bids).

That potentially could make (re)aggregation to become feasible with each commit

Additional points which came to mind:

git-annex metadata lives in separate git-annex branch which would be constantly changing while data is get/drop'ed. We need to record the state (hexsha) of git-annex branch when metadata is aggregated for later analysis of metadata differences from the last aggregation point (i think this is not yet done)

Allow for assigning any metadata extractor to dedicated 'collection' (extract file)

In general extracted metadata ends up in two XZ files per dataset, one for the dataset-global metadata and one for per-file metadata. However, there can be a need to extract metadata for different purposes and audiences. For example, a dedicated set of metadata extractors might pull out sensitive information that must only be made available to a closed audience. Or vice versa, for certain datasets pretty much any standard metadata extractor would leak sensitive information (e.g. filenames) and only a single certified extractor implementation shall be white-listed for public exposure.

(Keep in mind that while metadata typically comes with a dataset, ie. filenames/history are available anyways, metadata can easily be separated from a dataset and live in some other DB -- in such a case, even "blacklisting" or separating DataLad's own core metadata extractor can be meaningful or necessary).

A technical approach to this problem would be to add a configuration option that puts an extractor into a specific category, and all extracted metadata for a given category goes into another separate XZ file pair, with the category name being incorporated into the filename.

With this setup, git-annex "wanted" configuration can be added that puts annex metadata extracts on the desired siblings/remotes only whenever a dataset is published. Moreover, any such remote can also be used with data encryption, thereby adding another layer of confidence.

Would be good to chat with @jbpoline about this.

do not bother recording only the version if no other changes?

What is the problem?

after datalad aggregate-metadata --force-extraction to check if fixes in datalad/datalad-neuroimaging#39 did not introduce any other changes, found that the commit done was just

$> git show
commit 0233522bb00500110d7b56e99b498a50704b1a11 (HEAD -> master)
Author: Yaroslav Halchenko <[email protected]>
Date:   Fri Jul 27 22:01:39 2018 -0400

    [DATALAD] Dataset aggregate metadata update

diff --git a/.datalad/metadata/aggregate_v1.json b/.datalad/metadata/aggregate_v1.json
index cd57b226..8abb4d2f 100644
--- a/.datalad/metadata/aggregate_v1.json
+++ b/.datalad/metadata/aggregate_v1.json
@@ -4,7 +4,7 @@
 "content_info":
 "objects/20/cn-e571103a66837f7aa9178aadd47115.xz",
 "datalad_version":
-"0.10.0.rc5.dev18",
+"0.10.2.dev22",
 "dataset_info":
 "objects/20/ds-e571103a66837f7aa9178aadd47115",
 "extractors":

Although I could see how it might potentially be useful, I think it would be more beneficial to not breed such commits

What steps will reproduce the problem?

What version of DataLad are you using (run `datalad --version`)? On what operating system (consider running `datalad wtf`)?

Is there anything else that would be useful to know in this context?

Have you had any success using DataLad before? (to assess your expertise/prior luck. We would welcome your testimonial additions to https://github.com/datalad/datalad/wiki/Testimonials as well)

Get the DATS model extractor from CONP integrated

@jbpoline knows where it is and here are some datasets with such metadata: https://github.com/CONP-PCNO/conp-dataset/tree/master/projects

schema.org approach to representing quantitative measures

https://github.com/schemaorg/schemaorg/wiki/Using-UN-CEFACT-Codes

Helper to optimize JSON-LD metadata report for Google dataset search ingestion

It looks as if we can already get quite far in terms of getting schema.org metadata out of DataLad and into Google's dataset search. Now we need to make it more convenient to get practical benefits out of this. Few pointers to guide next steps:

demo dataset that is built from two RSS feeds: https://github.com/datalad-datasets/longnow-podcasts
small script to turn the RSS feed into schema.org metadata, structured for access via the custom metadata extractor: https://github.com/datalad-datasets/longnow-podcasts/blob/master/.datalad/maint/update_metadata_from_feed.py
helper to format a README.md based on DataLad's aggregated metadata. This is trivial to mark up with embedded microdata. Either following the example of @chrisgorgo (googlecreativelab/quickdraw-dataset#32), or with a non-table-like inline markup (PoC). Both work equally well.

Beyond this custom approach that would need adjustments and coding for each dataset, a more generic solution would be preferable. The output of the generic call:

datalad --format json meta-dump --reporton jsonld | jq '.metadata'

leaves the Google testing tool in pain. But the majority of the issues seem fixable:

undefined term contentbytesize (unclear if there is a better existing one)
Google doesn't like hasContributor as a Dataset property
a DigitalDocument cannot have a distribution. But for DataLad a plain URL isn't enough, so maybe we have to strip that from an export to make Google happy
Google does not recognize PROV-type agent (or definition is missing, needs check)

add annex sizes information to `annex` metadata

Could be quite useful to know how big the dataset is without installing it.

We could easily include output of the git annex info call, such as (removing remotes here):

$> git annex info --json --bytes | jq .
{
  "local annex size": "12795203",
  "size of annexed files in working tree": "39538339451",
...
  "backend usage": {
    "SHA256E": 1445
  },
  "local annex keys": 3,
  "available local disk space": "101327969728 (+1000000 reserved)",
  "annexed files in working tree": 1445,
}

Most frequently of interest is the size of the dataset and all of its subdatasts, so we should aggregate that information, either during metadata aggregation, or "dynamically" somehow since all that information about data sizes in the subdatasets would be available
This could be of relevance to https://github.com/datalad/datalad/issues/2403 . ATM web ui does similar sizes extraction, but does it also per each file/directory. If we maintain size information also per each file (for annexed - typically could be extracted from key which we already carry; for git - we would need it anyways) so it could be used to estimate also directories sizes "on the fly" (unless we eventually start providing metadata at directory level, which is feasible in many cases, such as subject info in BIDS per sub- directory)

metadata parsers autodetection

What is the problem?

Just continuing datalad/datalad#2079 (comment) . What I need/want is not a mode of operation which would be "autodetect" which we had before, but rather a helper to figure out which of the metadata parsers I could and possibly should enable for any particular dataset. Ideally, if I could run datalad aggregate-metadata --guess-parsers-only which would

report per parser a number of entities (files?) which it detected within dataset to bear some metadata it could parse
augment .datalad/config with those detected parsers it detected to be useful for this particular dataset.

Then I could inspect above output and git diff, agree/change and save. I could also, in the future as more parsers get introduced, to rerun the command and see if any new parsers found to be possibly useful for this dataset.

This would also address a usability issue that parsers are not really exposed well to the user, so it is impossible to know (at a level or user API) which parsers are available and which ones are applicable to this particular dataset.

RF rev_resolve_path -> resolve_path

Renamed in -core. datalad/datalad#3797

aggregate metadata without including it in the superdataset

Use case of HCP: it has thousands of .datalad/metadata/objects. Including HCP dataset into another superdataset and aggregating it metadata would duplicate them in that superdataset, requiring another thousand of files in the index, which would impact clone etc time, but wouldn't be used until metadata is actually used (e.g. for search).
I think a more scaleable setup would be

allow to specify (in .gitmodules) that for a given subdataset, refer to its aggregated metadata, instead of duplicating it
aggregate_v*.json could just refer to .datalad/metadata/objects of such subdatasets. If it is simply a ../../<subdataset>/.datalad/metadata prefix added, then datalad get machinery would be able to get them (installing <subdataset> if needed) and accomplish desired action. But that might be too inflexible (?), so may be allowing for some annotation that metadata should come from a specific subdataset (e.g. specified by additional info_source field containing submodule path) might be desired.

Please cut a release

So released version is compatible with development datalad

Metadata identifier concept

After an insightful discussion with @tgbugs thx!

Dataset (state)

identify a dataset uniquely
@id is the git commit of the dataset
identifier is dataset ID

Subdataset

identify a particular dataset as a subdataset at specific location and version
@id is a hash of dataset id + relative path + subdataset commit

File

identify a file with specific location and content in a particular dataset
@id is the hash of dataset ID (not Git commit!) + relative path within the dataset + contentid

Content

identify a particular file content, regardless of location context
@id is a git-annex key (or something like it for files in Git)
if the file extension is problematic, users must not use the *E backends

CustomMetadataExtractor.get_state() should expose config for ignoring keys from unique-value generator

Otherwise it is impossible to fine-tune this without hacking the code. All it needs is to read a config variable with a list of to-be-ignored keys, and report it as unique_exclude in the state. See:

https://github.com/datalad/datalad-metalad/blob/master/datalad_metalad/aggregate.py#L1142

Need config setting to prevent aggregation from certain subdatasets

What is the problem?

If we have a subdatasets with sensitive information (for internal use) of a public dataset, we want to be able to explicitly disable metadata aggregation and not rely on runtime logic, such as not using --recursive or not having this subdataset installed.

Aggregate metadata of non-dataset repositories

Any datalad dataset should have an ID, but some datasets on datasets.datalad.org have none -- at least according to the metadata in the superdataset.

[ERROR] skipped dataset record without an ID: /tmp/dl/devel/travis-buildlogs 
[ERROR] skipped dataset record without an ID: /tmp/dl/labs/gobbini/famface/data/scripts/mridefacer 
[ERROR] skipped dataset record without an ID: /tmp/dl/neurovault/snapshots

I think this is an issue. Not that such repositories exist (this is just fine), but that we carry metadata about them that cannot be properly associated with a dataset (just with a free floating commit).

Not sure how to deal with that. Probably refuse to aggregate from those (maybe unless forced).

--incremental -- should not be needed to update a single subdataset metadata

What is the problem?

originally mentioned while recrawling datasets - without incremental=True, upon

aggregate_metadata(
                    dataset='^',
                    path=self.repo.path,
                    update_mode='all')

all "neighboring" subdatasets in the super datasets woul loose their metadata. So I've added incremental=True (see datalad/datalad-crawler#10) but that leads to another side-effect -- datalad then doesn't remove those metadata object files which get replaced with the new ones:

commit 4f1a2bfab5406f74ba79943ca88bb4f5c324d8ec (HEAD -> master, tag: 1.0.1)
Author: DataLad Tester <[email protected]>
Date:   Thu Aug 16 12:16:48 2018 -0400

    [DATALAD] Dataset aggregate metadata update

 .datalad/metadata/aggregate_v1.json                               |   6 +++---
 .datalad/metadata/objects/b2/cn-71b6e379ea7b8a6f74d54e6718f448.xz | Bin 0 -> 32 bytes
 .datalad/metadata/objects/b2/ds-71b6e379ea7b8a6f74d54e6718f448    |  14 ++++++++++++++
 3 files changed, 17 insertions(+), 3 deletions(-)

commit 15aa34542c780c7a73e3c6ba8735af1961154c53
Merge: a47dd82 a7f04af
Author: DataLad Tester <[email protected]>
Date:   Thu Aug 16 12:16:47 2018 -0400

    Merge branch 'incoming-processed'

which causes tests to fail since we expected the "pruning" commit first (shouldn't ideally two of those be just a single commit?).

So the question to @mih

do you see a real need for having incremental mode of operation for that aggregate-metadata call?
nothing in --incremental docs mentions keeping previous objects for the dataset.

I wonder also

if pruning of "old" metadata files could be done within the same commit as an update?

Metadata API and implementation needs cleanup

Going through the code things appear to be messy. Lots of duplication, code to access metadata is intermingled with code store aggregated metadata. I plan to go through this part of the code base and clean thing up a bit (likely with functional implications, AFAICS now).

Issues:

More clearly separate read from write access, with the long-term aim to support read of more than the current/default aggregate layout version. ATM this is not needed, we only have one, but we better have the design reflect this future need. OTOH I don't see ourselves creating aggregates in more flavors than whatever we consider best at any given time.
RF to reduce code duplication
(to be continued)

metadata fields extra information or adapters on per-dataset basis?

What is the problem?

Some metadata extractors, such as annex would have no clue about fields data types. I wondered if there could be a way to provide some, possibly ad-hoc, supplemental information (e.g. type etc) and/or data adapters (e.g. again just int, float etc) for extracted metadata fields. May be that is already possible somehow?

Related: [X TO Y] search datalad/datalad#2347 - it relies (I guess) on knowing the type from the metadata values. So aggregated metadata should already be of the right type when it gets there

Fixup datalad's metadata schema to become proper JSON-LD

Following a recommendation by @nicholsn it seems most appropriate to follow the model of: https://schema.org/version/3.4/schema.jsonld

Once done, we also need to document a recommendation how metadata extractor authors should compose their contexts.

Add additional configuration for metadata extractors

What is the problem?

The current paradigm is to run any enabled metadata extractor on any file for content metadata extraction. With the number of extractors continuously increasing, this approach will become an issue. For example, in case of datalad/datalad#2254 this means invoking the singularity command once per each file in a dataset.... Hence we need better configurability -- and we should do it on a general level, so we do not have the complexity of extractors explode. Here is what I can think of right now:

Proposal

Add a config variable per extractor (with a sane default) that contains a regex that is matched against the relative path of a file in the dataset. Only matching paths are sent to an extractor.

This approach will enable easy preconfiguration for a wide range of common cases, i.e. filter paths by a given set of extensions.

A dataset can override/amend the default setting.

a YODA dataset has containers in a dedicated location that the singularity extractor could be pointed at
a BIDS dataset has stimuli in a dedicated place that image, and audio could be pointed at
...

Once we have dataset templates datalad/datalad#1462 appropriate pre-configurations for common scenarios could be prepared.

Current windows failures

Currently investigating the Windows failures -- recording them here:

Appveyor failures on master:

FAIL: datalad_metalad.extractors.tests.test_runprov.test_custom_dsmeta

======================================================================
FAIL: datalad_metalad.extractors.tests.test_runprov.test_custom_dsmeta
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\nose\case.py", line 198, in runTest
    self.test(*self.arg)
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\tests\utils.py", line 559, in newfunc
    return t(*(arg + (d,)), **kw)
  File "c:\projects\datalad-metalad\datalad_metalad\extractors\tests\test_runprov.py", line 51, in test_custom_dsmeta
    assert_result_count(res, 2)
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\tests\utils.py", line 1335, in assert_result_count
    dumps(results, indent=1, default=lambda x: str(x))))
AssertionError: Got 1 instead of 2 expected results matching {}. Inspected 1 record(s):
[
 {
  "path": "C:\\Users\\appveyor\\AppData\\Local\\Temp\\1\\datalad_temp_tree_751ocp6e",
  "action": "meta_extract",
  "refds": "C:\\Users\\appveyor\\AppData\\Local\\Temp\\1\\datalad_temp_tree_751ocp6e",
  "metadata": {
   "metalad_runprov": {
    "@graph": [
     {
      "@id": "9868fe2af8918d6cdb7ff42952039ce1422209ce",
      "atTime": "2020-02-13T06:52:51+00:00",
      "rdfs:comment": "[DATALAD RUNCMD] pristine",
      "prov:wasAssociatedWith": {
       "@id": "3f70461cb7bf5c45308533ef6f730f8b"
      },
      "@type": "activity"
     },
     {
      "@id": "3f70461cb7bf5c45308533ef6f730f8b",
      "name": "DataLad Tester",
      "email": "[email protected]",
      "@type": "agent"
     }
    ],
    "@context": "http://openprovenance.org/prov.jsonld"
   }
  },
  "refcommit": "73db20dcb1c0a84a80fa5d7f5ee97adb40abaa40",
  "status": "ok",
  "type": "dataset"
 }
]

FAIL: datalad_metalad.tests.test_aggregate.test_aggregate_removal

======================================================================
FAIL: datalad_metalad.tests.test_aggregate.test_aggregate_removal
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\nose\case.py", line 198, in runTest
    self.test(*self.arg)
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\tests\utils.py", line 559, in newfunc
    return t(*(arg + (d,)), **kw)
  File "c:\projects\datalad-metalad\datalad_metalad\tests\test_aggregate.py", line 449, in test_aggregate_removal
    eq_(_get_contained_objs(base), _get_referenced_objs(base))
AssertionError: {'.datalad\\metadata\\objects\\2c\\796447dab124639242c7cb84d73c9886a7cc82.xz', '.datalad\\metadata\\objects\\22\\334930b0b35e58678ffff3e5332e9d3dacc5e3.xz', '.datalad\\metadata\\objects\\5c\\ef1118477b6bd588b32bcce2f6c649efbfc3c6.xz', '.datalad\\metadata\\objects\\f7\\e353c67d7c03cf80206ed77c0501340fb1166e.xz', '.datalad\\metadata\\objects\\b9\\ece1471887a834d8aeb0f0a6ac6d4497f8bda4.xz', '.datalad\\metadata\\objects\\22\\5e2be622c0a6e7e9cf2a2e29046ccba48888d0.xz'} != {'.datalad/metadata/objects/f7/e353c67d7c03cf80206ed77c0501340fb1166e.xz', '.datalad/metadata/objects/b9/ece1471887a834d8aeb0f0a6ac6d4497f8bda4.xz', '.datalad/metadata/objects/22/5e2be622c0a6e7e9cf2a2e29046ccba48888d0.xz', '.datalad/metadata/objects/2c/796447dab124639242c7cb84d73c9886a7cc82.xz', '.datalad/metadata/objects/22/334930b0b35e58678ffff3e5332e9d3dacc5e3.xz', '.datalad/metadata/objects/5c/ef1118477b6bd588b32bcce2f6c649efbfc3c6.xz'}

FAIL: datalad_metalad.tests.test_aggregate.test_reaggregate_with_unavailable_objects

======================================================================
FAIL: datalad_metalad.tests.test_aggregate.test_reaggregate_with_unavailable_objects
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\nose\case.py", line 198, in runTest
    self.test(*self.arg)
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\tests\utils.py", line 559, in newfunc
    return t(*(arg + (d,)), **kw)
  File "c:\projects\datalad-metalad\datalad_metalad\tests\test_aggregate.py", line 342, in test_reaggregate_with_unavailable_objects
    eq_(all(base.repo.file_has_content(objs)), True)
AssertionError: False != True

FAIL: datalad_metalad.tests.test_aggregate.test_reaggregate

======================================================================
FAIL: datalad_metalad.tests.test_aggregate.test_reaggregate
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\nose\case.py", line 198, in runTest
    self.test(*self.arg)
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\tests\utils.py", line 732, in newfunc
    return t(*(arg + (filename,)), **kw)
  File "c:\projects\datalad-metalad\datalad_metalad\tests\test_aggregate.py", line 681, in test_reaggregate
    eq_(good_state, ds.repo.get_hexsha())
AssertionError: '71437976ee6d1dd57153af9740b1931dabfa1cdf' != '5bf4e3845f71f78fcedb6e05305423906ce3ce4b'

FAIL: datalad_metalad.tests.test_base.test_get_refcommit

======================================================================
FAIL: datalad_metalad.tests.test_base.test_get_refcommit
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\nose\case.py", line 198, in runTest
    self.test(*self.arg)
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\tests\utils.py", line 732, in newfunc
    return t(*(arg + (filename,)), **kw)
  File "c:\projects\datalad-metalad\datalad_metalad\tests\test_base.py", line 79, in test_get_refcommit
    eq_(get_refcommit(ds), real_change)
AssertionError: '9eb37f392fb62e1d0dfbfb79343c0ed400a7e440' != '7b30fb98babb13a223f7ac21807dfdc459157212'

FAIL: datalad_metalad.tests.test_report.test_query

======================================================================
FAIL: datalad_metalad.tests.test_report.test_query
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\nose\case.py", line 198, in runTest
    self.test(*self.arg)
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\tests\utils.py", line 732, in newfunc
    return t(*(arg + (filename,)), **kw)
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\tests\utils.py", line 732, in newfunc
    return t(*(arg + (filename,)), **kw)
  File "c:\projects\datalad-metalad\datalad_metalad\tests\test_report.py", line 157, in test_query
    {k: v for k, v in iteritems(query_res[0]) if k not in strip},
AssertionError: {'type': 'dataset', 'status': 'ok', 'metadata': {'@graph': [{'@id': '3f70461cb7bf5c45308533ef6f730f8b', 'name': 'DataLad Tester', 'email': '[email protected]', '@type': 'agent'}, {'version': '0-7-g4d4e3c4', 'hasPart': [{'@id': 'datalad:MD5E-s7--9a0364b9e99bb480dd25e1f0284c8555.dat', 'name': 'file.dat', '@type': 'DigitalDocument'}, {'@id': 'datalad:b9ca047bf60ea7328f637cd2bbac138ffa4a5df0', 'name': 'sub', 'identifier': 'datalad:6a322348-4e2e-11ea-8049-00155dfb566a', '@type': 'Dataset'}], 'dateModified': '2020-02-13T06:59:52+00:00', '@id': '4d4e3c4b18a4a20afee49e0eb1d878dcb439e413', 'hasContributor': {'@id': '3f70461cb7bf5c45308533ef6f730f8b'}, 'contentbytesize': 7, 'identifier': '67e35838-4e2e-11ea-820d-00155dfb566a', 'dateCreated': '2020-02-13T06:59:39+00:00', '@type': 'Dataset'}, {'contentbytesize': 7, '@id': 'datalad:MD5E-s7--9a0364b9e99bb480dd25e1f0284c8555.dat', 'name': 'file.dat'}], '@context': {'datalad': 'http://dx.datalad.org/', '@vocab': 'http://schema.org/'}}, 'refcommit': '4d4e3c4b18a4a20afee49e0eb1d878dcb439e413'} != {'type': 'dataset', 'status': 'ok', 'metadata': {'@graph': [{'@id': '3f70461cb7bf5c45308533ef6f730f8b', 'name': 'DataLad Tester', 'email': '[email protected]', '@type': 'agent'}, {'version': '0-6-g8cf1904', 'hasPart': [{'@id': 'datalad:MD5E-s7--9a0364b9e99bb480dd25e1f0284c8555.dat', 'name': 'file.dat', '@type': 'DigitalDocument'}, {'@id': 'datalad:b9ca047bf60ea7328f637cd2bbac138ffa4a5df0', 'name': 'sub', 'identifier': 'datalad:6a322348-4e2e-11ea-8049-00155dfb566a', '@type': 'Dataset'}], 'dateModified': '2020-02-13T06:59:48+00:00', '@id': '8cf1904e8ca46f472e27bc7351c702f88b8ed49e', 'hasContributor': {'@id': '3f70461cb7bf5c45308533ef6f730f8b'}, 'contentbytesize': 7, 'identifier': '67e35838-4e2e-11ea-820d-00155dfb566a', 'dateCreated': '2020-02-13T06:59:39+00:00', '@type': 'Dataset'}, {'contentbytesize': 7, '@id': 'datalad:MD5E-s7--9a0364b9e99bb480dd25e1f0284c8555.dat', 'name': 'file.dat'}], '@context': {'datalad': 'http://dx.datalad.org/', '@vocab': 'http://schema.org/'}}, 'refcommit': '8cf1904e8ca46f472e27bc7351c702f88b8ed49e'}

These failures are identical on our Windows test box.

Appveyor failures on #34

FAIL: datalad_metalad.tests.test_aggregate.test_aggregate_into_top_no_extraction

======================================================================
FAIL: datalad_metalad.tests.test_aggregate.test_aggregate_into_top_no_extraction
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\nose\case.py", line 198, in runTest
    self.test(*self.arg)
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\tests\utils.py", line 732, in newfunc
    return t(*(arg + (filename,)), **kw)
  File "c:\projects\datalad-metalad\datalad_metalad\tests\test_aggregate.py", line 889, in test_aggregate_into_top_no_extraction
    type='dataset'
  File "C:\Miniconda35\envs\test-environment\lib\site-packages\datalad\tests\utils.py", line 1335, in assert_result_count
    dumps(results, indent=1, default=lambda x: str(x))))
AssertionError: Got 0 instead of 1 expected results matching {'action': 'meta_extract', 'status': 'notneeded', 'type': 'dataset'}. Inspected 4 record(s):
[
 {
  "action": "add",
  "type": "file",
  "path": "C:\\Users\\appveyor\\AppData\\Local\\Temp\\1\\datalad_temp_lqek3pbt\\.datalad\\metadata\\aggregate_v1.json",
  "refds": "C:\\Users\\appveyor\\AppData\\Local\\Temp\\1\\datalad_temp_lqek3pbt",
  "status": "ok",
  "message": "",
  "key": null
 },
 {
  "action": "add",
  "type": "file",
  "path": "C:\\Users\\appveyor\\AppData\\Local\\Temp\\1\\datalad_temp_lqek3pbt\\.datalad\\metadata\\objects\\3e\\9759429956d9136b87cf62f4e1191104026af3.xz",
  "refds": "C:\\Users\\appveyor\\AppData\\Local\\Temp\\1\\datalad_temp_lqek3pbt",
  "status": "ok",
  "message": "",
  "key": "MD5E-s460--e1f9a1a624123fc591723b73a3f5944d.xz"
 },
 {
  "type": "dataset",
  "status": "ok",
  "path": "C:\\Users\\appveyor\\AppData\\Local\\Temp\\1\\datalad_temp_lqek3pbt",
  "refds": "C:\\Users\\appveyor\\AppData\\Local\\Temp\\1\\datalad_temp_lqek3pbt",
  "action": "save"
 },
 {
  "type": "dataset",
  "status": "ok",
  "path": "C:\\Users\\appveyor\\AppData\\Local\\Temp\\1\\datalad_temp_lqek3pbt",
  "action": "meta_aggregate"
 }
]

OPT: prevent runtime penalty of effectively duplicate status() call

During metadata aggregation, rev-aggregate-metadata will call status() to determine if a dataset is modified , and to discover all to-be-processed subdatasets. Subsequently, rev-extract-metadata will call status() again to get list of content items.

As of now, the output of the second status() call is what any extractor is operating on, hence each information bit that is stripped from it, will not be accessbile to an extractor, and in turn might lead to additional queries.

How do we deal with sensitive meta data

There could be some meta data which is sensitive (eg exam date) which we could potentially aggregate.
Is there a way to aggregate that data as well somehow, without leaking such sensitive information??
One possible solution is to "annotate" (within dataset level meta?) what fields could potentially be extracted from the data files on a client side, iff they have access to those files. And then, update of the index on client side would pick them up and incorporate into the index

meta-extract with --format jsonld fails sometimes

I observed a few crashes while playing around with meta_extract in this dataset and trying to use --format jsonld .

A meta-extraction using metalad_core to --format jsonld failed with an AttributeError after metadata extraction

[INFO] Extracted core metadata from /home/adina/repos/multimatch_forrest 
[INFO] Finished core metadata extraction from <Dataset path=/home/adina/repos/multimatch_forrest> 
[INFO] Finished metadata extraction from <Dataset path=/home/adina/repos/multimatch_forrest> 
[ERROR] 'list' object has no attribute 'keys' [extract.py:custom_result_renderer:428] (AttributeError)

It happens when I don't specify -f, but it does not happen if I specify formatting to pretty-printed JSON with -f json_pp.
That is, this command works fine:

datalad -f json_pp meta-extract -d . --source metalad_runprov --source metalad_core --format jsonld

While this command produces the above Error:

datalad meta-extract -d . --source metalad_runprov --source metalad_core --format jsonld

Furthermore, these commands (not using metalad_core anymore, but also formatting to jsonld, with or without pretty-printed JSON)

datalad meta-extract -d . --source metalad_runprov --format jsonld

datalad -f json_pp meta-extract -d . --source metalad_runprov --format jsonld

also throw an error:

[ERROR  ] 'metalad_core' [__init__.py:collect_jsonld_metadata:239] (KeyError)

but without --format jsonld it works fine:

datalad -f json_pp meta-extract -d . --source metalad_runprov

Consider not using interpretable, non-hash @id values

To discourage people from interpreting them. Tried it out, and it is indeed better to not use human-readable IDs... for humans, paradoxically...

meta-aggregate: Unnecessary forced extraction

Discovered via psychoinformatics-de/datalad-hirni#117

I have a subdataset that already has up-to-date metadata aggregated in itself. The superdataset has no metadata yet whatsoever. Now, datalad meta-aggregate sub/ --into top on the superdataset, leads to forced re-extraction of metadata in the subdataset, instead of just getting the already available metadata.

Working on providing a test (and hopefully a fix).

Metadata parser for XMP sidecar files

This way we could read images tags (faces, person names, ...) assigned in tools like digikam, and others. It could be used in combination with the exif parser. But is useful also on its own. XMP metadata can be found in images, videos, but also PDFs.

Here is the python package to do that: https://python-xmp-toolkit.readthedocs.io

Turn metadata extractors into commands?

I started RF'ing the metadata code base. I increasingly dislike the special status of the extractors. They are essentially generators that yield JSON-serializable records -- just like any other command. Why not make them regular commands?

We would just need to define a minimal API that any extractor has to be compliant with.

Version each extractor and include extractor's version within extracted metadata

I see it as the only way to automagically figure out when re-extraction is necessary. Otherwise we only have --force-extraction option which is non-specific, thus would reextract for all extractors which might take time, and require data access

Regain compatibility with datasets using adjusted branches

Here is the story: datalad/datalad#3817

The old metadata code is affected, unsure if we have carried that burden over...

core extractor should report per file commit hexsha of last modification

That should make it easier to do incremental metadata updates and be able to verify if all files metadata is up to date or some is behind (eg if some extractor wasn't available at that point)

meta-metadata extractor: NLP

an idea brought up by the reviewer 1 in the recent NSF grant proposal -- NLP metadata extractor for not clearly structured metadata such as clinical annotations. Citing: "NLP will allow for automated extraction and grouping of metadata terms, which is crucial for accurate querying. Otherwise, the query is likely to miss key datasets due to differences in metadata wording."

It feels that it is best fit into some extractor which would consume extracted metadata, but could be an extractor on its own OR utility to reuse/enable for some metadata fields during exctraction

Webcatalog landing page should conform to datacite schema

https://schema.datacite.org/

and maybe opengraph metadata too?

Add a metada extractor for .zenodo.json

550+ projects on github have this kind of metadata.

metamovie metadata extractor?

see https://github.com/pigmonkey/metamovie

Modernization of metadata handling (excluding search)

Plan

I am not aware of any need to change the format and structure of the current metadata aggregates (apart from datalad/datalad#3105)
Three separate components for metadata extraction, aggregation and access (not search) datalad/datalad#3134 (note: was closed without merging)

Extraction

Aggregation

Access

ensure to always report the best available metadata https://github.com/datalad/datalad/issues/3055
report hasPart/isPartOf properties for datasets and files respectively on access (implicitely defined in the internal storage structure, hence no need to pollute the extracted metadata with it

Map dataset properties to schema.org terms

I want to improve the usability of the output of our internal metadata extractors -- those that look at common properties of all datasets (I consider anything else out-of-scope here), and therefore can always run wrt to applications like datalad/datalad-revolution#76

For this, it would be helpful to discuss and agree on a mapping of such properties on schema.org terms. The following is a list of terms that (I think) are applicable, and their proposed mapping.
Please contribute by extending the list, and arguing for/against my proposal. Thx!

After a bit of thinking, I am of the opinion that we should avoid shoehorning meaning onto terms that exist, but are not a 100% match. So the following is a list of only those properties, where I see such amatch.

Standard, with a definable source

The following aspects have a (potentially definitive) source within the scope of the core metadata extractors.

`identifier` [recommended by Google]

The identifier property represents any kind of identifier for any kind of Thing, such as ISBNs, GTIN codes, UUIDs etc.
https://schema.org/identifier (PropertyValue | Text | URL)

This is a dataset's UUID. This ID is also used to identify relationships between datasets.

See the hasPart section for a potential reason to also consider the latest commit SHA as an additional identifier.

`contributor`

A secondary contributor to the CreativeWork or Event.
https://schema.org/contributor (Organization | Person)

We have no way to infer author, but we can surely state that any author of a commit in the history is a contributor. An extractor should consult the mailmap to give a sensible report. The list of contributors reported by the extractor need not be exhaustive.

Something like this: git log --use-mailmap --no-merges --format=format:'%aN%x00%aE%n%cN%x00%cE' |sort |uniq

`hasPart`

Indicates an item or CreativeWork that is part of this item, or CreativeWork (in some sense)
https://schema.org/hasPart (CreativeWork)

These are primarily subdatasets (referenced by their dataset ID), but we could provide a list of files (by annex-key or shasum too).

This one is tricky to assemble as a given dataset may not have all information about subdatasets (e.g. their IDs aggregated, and we cannot rely on or require all subdatasets to be installed. What we do know, however, is the state of the subdataset that is referenced (commit) -- this is as much of a precise ID then the UUID, but much more volatile.

`isPartOf`

Indicates an item or CreativeWork that this item, or CreativeWork (in some sense), is part of.
https://schema.org/isPartOf (CreativeWork)

This is the UUID of the superdataset. BUT see includedInDataCatalog. For any given dataset, this is easier to determine that the hasPart side of thing. We just need to look for a single superdataset once vs. all the subdatasets.

`distribution`

A downloadable form of this dataset, at a specific location, in a specific format.
https://schema.org/distribution (DataDownload)

This is any remote of a dataset, described by a compound object (see below for applicable properties).

`distribution.contentUrl`

Actual bytes of the media object
https://schema.org/contentUrl (URL)

A URL that datalad install can act on. The media object here is the dataset itself.

`dateCreated`

The date on which the CreativeWork was created or the item was added to a DataFeed.
https://schema.org/dateCreated (Date | DateTime)

Timestamp of the initial commit.

`(distribution.)dateModified`

The date on which the CreativeWork was most recently modified or when the item's entry was modified within a DataFeed.
https://schema.org/dateModified (Date | DateTime)

The timestamp of the last commit on record for the dataset, or, in case of a DataDownload, the respective remote.

`distribution.uploadDate`

Date when this media object was uploaded to this site.
https://schema.org/uploadDate (Date)

We cannot easily say this, unless we make "publish" leave a trace.

`distribution.name`

The name of the item
https://schema.org/name (Text)

Name of a remote. Not sure if special remotes qualify, as we need to identify distributions of the dataset,
not (parts) of its content.

`distribution.url`

URL of the item
https://schema.org/url (URL)

The (fetch) URL of a remote.

`distribution.description`

A description of the item.
https://schema.org/description (Text)

A short description of the nature of the remote (Git,or git-annex special-remote of type ...)

`distribution.identifier`

The identifier property represents any kind of identifier for any kind of Thing, such as ISBNs, GTIN codes, UUIDs etc.
https://schema.org/identifier (PropertyValue | Text | URL)

This UUID of a git-annex key-store for a remote (if any exists).

`provider`

The service provider, service operator, or service performer; the goods producer. Another party (a seller) may offer those services or goods on behalf of the provider. A provider may also serve as the seller.
https://schema.org/provider (Organization | Person)

This could be a record for DataLad itself, identifying it as the service provider (in the scope of the Dataset record; for DataDownload it would be a data portal or something else, but this is impossible to infer in general ).

There is also publisher, but I don't think that matches the role of DataLad.

`version` [recommended by Google]

The version of the CreativeWork embodied by a specified resource.
https://schema.org/version (Number | Text)

0-<ncommits>-<refcommit-shasum>

Poor man's alternative to git describe -- which we cannot use unconditionally, as it needs (annotated) tags to function. Instead, we count any commit and use the initial commit as a universal reference. Above format mimics git describe output, but uses 0 as a constant prefix (not a tag).

ncommits = git log --no-merges --format=oneline |wc -l

`includedInDataCatalog` [recommended by Google]

A data catalog which contains this dataset.
https://schema.org/includedInDataCatalog (DataDownload)

List of remotes (i.e. distributions, see above), plus the super dataset (topmost only, to not by redundant with isPartOf).

Not sure how to reference the superdataset as a DataCatalog-type. For DataLad any Dataset is also a DataCatalog. Maybe we should market any dataset as a catalog, but we would loose the distribution field when switching the type (and not sure of things like Google dataset search are happy with this).

One approach to deal with this in a context like datalad/datalad-revolution#76 (where we know that a superdataset is serving the purpose of a data catalog, as opposed to just tracking dependencies) would be to generate a single page with a DataCatalog-type metadata record, and for all subdataset pages refer to this page as the containing catalog. In DataLad's actual metadata records, however, we do not use this (as the "is-in" relationship is only reliably determined from a (distant) superdataset). Instead, we limit the record to immediate child relationships, i.e. hasPart, and only inject isPartOf and includedInDataCatalog at the time of exporting metadata in a specific context, for a specific purpose.

Should be standard, but have no standard source

For the following aspects we could implement heuristics. Extracted metadata should only contain facts. Any such heuristics should be employed/executed at a late stage in an application context (where it is known how error-tolerant one can be). Hence, we only have to think about, what kind of factual information we want to extract to enable such heuristics.

`license`

A license document that applies to this content, typically indicated by URL.
https://schema.org/license (CreativeWork | URL)

We could extract the content of LICENSE or COPYING, if such file exists.

`keywords`

Keywords or tags used to describe this content. Multiple entries in a keywords list are typically delimited by commas.
https://schema.org/keywords (Text)

Amend with the names of metadata extractors.

meta-aggregate must not behave differently, with or without --dataset

Setup:

(hirni-dev) ben@tree:/tmp$ datalad create super
[INFO   ] Creating a new annex repo at /tmp/super 
create(ok): /tmp/super (dataset)                                                                                                                                                                                                                              
(hirni-dev) ben@tree:/tmp$ cd super
(hirni-dev) ben@tree:/tmp/super$ datalad install -d . -s https://github.com/datalad/example-dicom-functional
[INFO   ] Cloning https://github.com/datalad/example-dicom-functional [1 other candidates] into '/tmp/super/example-dicom-functional' 
install(ok): example-dicom-functional (dataset)                                                                                                                                                                                                               
action summary:
  add (ok: 2)
  install (ok: 1)
  save (ok: 1)
(hirni-dev) ben@tree:/tmp/super$ git status
On branch master
nothing to commit, working tree clean
(hirni-dev) ben@tree:/tmp/super$ ll
total 48
drwxr-xr-x  5 ben  ben   4096 May 17 18:49 .
drwxrwxrwt 49 root root 20480 May 17 18:49 ..
drwxr-xr-x  2 ben  ben   4096 May 17 18:30 .datalad
drwxr-xr-x  5 ben  ben   4096 May 17 18:49 example-dicom-functional
drwxr-xr-x  9 ben  ben   4096 May 17 18:50 .git
-rw-r--r--  1 ben  ben     55 May 17 18:30 .gitattributes
-rw-r--r--  1 ben  ben    182 May 17 18:49 .gitmodules

Now, I want metadata:

(hirni-dev) ben@tree:/tmp/super$ datalad meta-aggregate example-dicom-functional/ --into top
add(ok): .datalad/metadata/aggregate_v1.json (file)                                                                                                                                                                                                           
add(ok): .datalad/metadata/objects/76/16940cda42de07bf523b83749c11a308ed1f33.xz (file)                                                                                                                                                                        
add(ok): .datalad/metadata/objects/93/5a7a3d5923f6a904416fa1dc5ef68f59757323.xz (file)                                                                                                                                                                        
save(ok): . (dataset)
meta_aggregate(ok): /tmp/super (dataset)
action summary:
  add (ok: 3)
  meta_aggregate (ok: 1)
  save (ok: 1)
(hirni-dev) ben@tree:/tmp/super$ datalad meta-dump example-dicom-functional --reporton datasets
/tmp/super/example-dicom-functional (dataset): datalad_unique_content_properties,dicom,metalad_core

Looks good. Do it again by force to see the same thing would be the result:

(hirni-dev) ben@tree:/tmp/super$ datalad meta-aggregate example-dicom-functional --into top --force fromscratch
add(ok): .datalad/metadata/aggregate_v1.json (file)                                                                                                                                                                                                           
add(ok): .datalad/metadata/objects/76/16940cda42de07bf523b83749c11a308ed1f33.xz (file)                                                                                                                                                                        
add(ok): .datalad/metadata/objects/93/5a7a3d5923f6a904416fa1dc5ef68f59757323.xz (file)                                                                                                                                                                        
save(ok): . (dataset)
meta_aggregate(ok): /tmp/super (dataset)
action summary:
  add (ok: 3)
  meta_aggregate (ok: 1)
  save (ok: 1)
(hirni-dev) ben@tree:/tmp/super$ datalad meta-dump example-dicom-functional --reporton datasets
/tmp/super/example-dicom-functional (dataset): datalad_unique_content_properties,dicom,metalad_core

Looks also just fine.
Now do the very same thing via python:

(hirni-dev) ben@tree:/tmp/super$ python -c "from datalad.api import Dataset; Dataset('.').meta_dump('example-dicom-functional', reporton='datasets')"
example-dicom-functional (dataset): datalad_unique_content_properties,dicom,metalad_core
(hirni-dev) ben@tree:/tmp/super$ python -c "from datalad.api import Dataset; Dataset('.').meta_aggregate('example-dicom-functional', into='top', force='fromscratch')"
(hirni-dev) ben@tree:/tmp/super$ datalad meta-dump example-dicom-functional --reporton datasets
/tmp/super (dataset): metalad_core

WTF? Where is the DICOM metadata?
While the cmdline call triggers the DICOM extractor, the python call doesn't for some reason.
ATM I have no idea how the call, the command itself or my environment is messed up.

# WTF
- path: /tmp/super
- type: dataset
## configuration <SENSITIVE, report disabled by configuration>
## datalad 
  - full_version: 0.12.0rc4.dev5-gb59b
  - version: 0.12.0rc4.dev5
## dataset 
  - metadata: <SENSITIVE, report disabled by configuration>
  - path: /tmp/super
  - repo: AnnexRepo
## dependencies 
  - appdirs: 1.4.3
  - boto: 2.49.0
  - cmd:annex: 7.20190129
  - cmd:git: 2.20.1
  - cmd:system-git: 2.20.1
  - cmd:system-ssh: 7.9p1
  - exifread: 2.1.2
  - git: 2.1.8
  - gitdb: 2.0.5
  - humanize: 0.5.1
  - iso8601: 0.1.12
  - keyring: 19.0.1
  - keyrings.alt: 3.1.1
  - msgpack: 0.6.1
  - mutagen: 1.42.0
  - requests: 2.21.0
  - six: 1.12.0
  - tqdm: 4.31.1
  - wrapt: 1.11.1
## environment 
  - GIT_PYTHON_GIT_EXECUTABLE: /usr/bin/git
  - LANG: en_US.UTF-8
  - LANGUAGE: en_US:en
  - PATH: /home/ben/venvs/hirni-dev/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
## extentions 
  - container: 
    - description: Containerized environments
    - entrypoints: 
      - datalad_container.containers_add.ContainersAdd: 
        - class: ContainersAdd
        - load_error: None
        - module: datalad_container.containers_add
        - names: 
          - containers-add
          - containers_add
      - datalad_container.containers_list.ContainersList: 
        - class: ContainersList
        - load_error: None
        - module: datalad_container.containers_list
        - names: 
          - containers-list
          - containers_list
      - datalad_container.containers_remove.ContainersRemove: 
        - class: ContainersRemove
        - load_error: None
        - module: datalad_container.containers_remove
        - names: 
          - containers-remove
          - containers_remove
      - datalad_container.containers_run.ContainersRun: 
        - class: ContainersRun
        - load_error: None
        - module: datalad_container.containers_run
        - names: 
          - containers-run
          - containers_run
    - load_error: None
    - module: datalad_container
    - version: 0.3.1
  - hirni: 
    - description: HIRNI workflows
    - entrypoints: 
      - datalad_hirni.commands.dicom2spec.Dicom2Spec: 
        - class: Dicom2Spec
        - load_error: None
        - module: datalad_hirni.commands.dicom2spec
        - names: 
          - hirni-dicom2spec
          - hirni_dicom2spec
      - datalad_hirni.commands.import_dicoms.ImportDicoms: 
        - class: ImportDicoms
        - load_error: None
        - module: datalad_hirni.commands.import_dicoms
        - names: 
          - hirni-import-dcm
          - hirni_import_dcm
      - datalad_hirni.commands.spec2bids.Spec2Bids: 
        - class: Spec2Bids
        - load_error: None
        - module: datalad_hirni.commands.spec2bids
        - names: 
          - hirni-spec2bids
          - hirni_spec2bids
      - datalad_hirni.commands.spec4anything.Spec4Anything: 
        - class: Spec4Anything
        - load_error: None
        - module: datalad_hirni.commands.spec4anything
        - names: 
          - hirni-spec4anything
          - hirni_spec4anything
    - load_error: None
    - module: datalad_hirni
    - version: None
  - metalad: 
    - description: DataLad semantic metadata command suite
    - entrypoints: 
      - datalad_metalad.aggregate.Aggregate: 
        - class: Aggregate
        - load_error: None
        - module: datalad_metalad.aggregate
        - names: 
          - meta-aggregate
          - meta_aggregate
      - datalad_metalad.dump.Dump: 
        - class: Dump
        - load_error: None
        - module: datalad_metalad.dump
        - names: 
          - meta-dump
          - meta_dump
      - datalad_metalad.extract.Extract: 
        - class: Extract
        - load_error: None
        - module: datalad_metalad.extract
        - names: 
          - meta-extract
          - meta_extract
    - load_error: None
    - module: datalad_metalad
    - version: 0.1.0
  - neuroimaging: 
    - description: Neuroimaging tools
    - entrypoints: 
      - datalad_neuroimaging.bids2scidata.BIDS2Scidata: 
        - class: BIDS2Scidata
        - load_error: None
        - module: datalad_neuroimaging.bids2scidata
        - names: 
          - bids2scidata
    - load_error: None
    - module: datalad_neuroimaging
    - version: 0.2.1
  - revolution: 
    - description: DataLad revolutionary command suite
    - entrypoints: 
      - datalad_revolution.metadata.query.QueryMetadata: 
        - class: QueryMetadata
        - load_error: None
        - module: datalad_revolution.metadata.query
        - names: 
          - query-metadata
          - query_metadata
      - datalad_revolution.metadata.revaggregate.RevAggregateMetadata: 
        - class: RevAggregateMetadata
        - load_error: None
        - module: datalad_revolution.metadata.revaggregate
        - names: 
          - rev-aggregate-metadata
          - rev_aggregate_metadata
      - datalad_revolution.metadata.revextract.RevExtractMetadata: 
        - class: RevExtractMetadata
        - load_error: None
        - module: datalad_revolution.metadata.revextract
        - names: 
          - rev-extract-metadata
          - rev_extract_metadata
      - datalad_revolution.revcreate.RevCreate: 
        - class: RevCreate
        - load_error: None
        - module: datalad_revolution.revcreate
        - names: 
          - rev-create
          - rev_create
      - datalad_revolution.revdiff.RevDiff: 
        - class: RevDiff
        - load_error: None
        - module: datalad_revolution.revdiff
        - names: 
          - rev-diff
          - rev_diff
      - datalad_revolution.revstatus.RevStatus: 
        - class: RevStatus
        - load_error: None
        - module: datalad_revolution.revstatus
        - names: 
          - rev-status
          - rev_status
    - load_error: None
    - module: datalad_revolution
    - version: 0.10.0
  - webapp: 
    - description: Generic web app support
    - entrypoints: 
      - datalad_webapp.WebApp: 
        - class: WebApp
        - load_error: None
        - module: datalad_webapp
        - names: 
          - webapp
          - webapp
    - load_error: None
    - module: datalad_webapp
    - version: 0.2
## git-annex 
  - build flags: 
    - Assistant
    - Webapp
    - Pairing
    - S3(multipartupload)(storageclasses)
    - WebDAV
    - Inotify
    - DBus
    - DesktopNotify
    - TorrentParser
    - MagicMime
    - Feeds
    - Testsuite
  - dependency versions: 
    - aws-0.20
    - bloomfilter-2.0.1.0
    - cryptonite-0.25
    - DAV-1.3.3
    - feed-1.0.0.0
    - ghc-8.4.4
    - http-client-0.5.13.1
    - persistent-sqlite-2.8.2
    - torrent-10000.1.1
    - uuid-1.3.13
    - yesod-1.6.0
  - key/value backends: 
    - SHA256E
    - SHA256
    - SHA512E
    - SHA512
    - SHA224E
    - SHA224
    - SHA384E
    - SHA384
    - SHA3_256E
    - SHA3_256
    - SHA3_512E
    - SHA3_512
    - SHA3_224E
    - SHA3_224
    - SHA3_384E
    - SHA3_384
    - SKEIN256E
    - SKEIN256
    - SKEIN512E
    - SKEIN512
    - BLAKE2B256E
    - BLAKE2B256
    - BLAKE2B512E
    - BLAKE2B512
    - BLAKE2B160E
    - BLAKE2B160
    - BLAKE2B224E
    - BLAKE2B224
    - BLAKE2B384E
    - BLAKE2B384
    - BLAKE2S256E
    - BLAKE2S256
    - BLAKE2S160E
    - BLAKE2S160
    - BLAKE2S224E
    - BLAKE2S224
    - BLAKE2SP256E
    - BLAKE2SP256
    - BLAKE2SP224E
    - BLAKE2SP224
    - SHA1E
    - SHA1
    - MD5E
    - MD5
    - WORM
    - URL
  - local repository version: 5
  - operating system: linux x86_64
  - remote types: 
    - git
    - gcrypt
    - p2p
    - S3
    - bup
    - directory
    - rsync
    - web
    - bittorrent
    - webdav
    - adb
    - tahoe
    - glacier
    - ddar
    - hook
    - external
  - supported repository versions: 
    - 5
    - 7
  - upgrade supported from repository versions: 
    - 0
    - 1
    - 2
    - 3
    - 4
    - 5
    - 6
  - version: 7.20190129
## metadata_extractors 
  - annex: 
    - load_error: None
    - module: datalad.metadata.extractors.annex
    - version: None
  - audio: 
    - load_error: None
    - module: datalad.metadata.extractors.audio
    - version: None
  - bids: 
    - load_error: None
    - module: datalad_neuroimaging.extractors.bids
    - version: None
  - custom: 
    - load_error: None
    - module: datalad_revolution.metadata.extractors.custom
    - version: None
  - datacite: 
    - load_error: None
    - module: datalad.metadata.extractors.datacite
    - version: None
  - datalad_core: 
    - load_error: None
    - module: datalad.metadata.extractors.datalad_core
    - version: None
  - datalad_rfc822: 
    - load_error: None
    - module: datalad.metadata.extractors.datalad_rfc822
    - version: None
  - dicom: 
    - load_error: None
    - module: datalad_neuroimaging.extractors.dicom
    - version: None
  - exif: 
    - load_error: None
    - module: datalad.metadata.extractors.exif
    - version: None
  - frictionless_datapackage: 
    - load_error: None
    - module: datalad.metadata.extractors.frictionless_datapackage
    - version: None
  - image: 
    - load_error: None
    - module: datalad.metadata.extractors.image
    - version: None
  - metalad_annex: 
    - load_error: None
    - module: datalad_metalad.extractors.annex
    - version: None
  - metalad_core: 
    - load_error: None
    - module: datalad_metalad.extractors.core
    - version: None
  - metalad_custom: 
    - load_error: None
    - module: datalad_metalad.extractors.custom
    - version: None
  - metalad_runprov: 
    - load_error: None
    - module: datalad_metalad.extractors.runprov
    - version: None
  - nidm: 
    - load_error: None
    - module: datalad_neuroimaging.extractors.nidm
    - version: None
  - nifti1: 
    - load_error: None
    - module: datalad_neuroimaging.extractors.nifti1
    - version: None
  - xmp: 
    - load_error: Exempi library not found. [exempi.py:_load_exempi:60]
    - module: datalad.metadata.extractors.xmp
## python 
  - implementation: CPython
  - version: 3.7.3rc1
## system 
  - distribution: debian/buster/sid
  - encoding: 
    - default: utf-8
    - filesystem: utf-8
    - locale.prefered: UTF-8
  - max_path_length: 266
  - name: Linux
  - release: 4.19.0-4-amd64
  - type: posix
  - version: #1 SMP Debian 4.19.28-2 (2019-03-15)

Superds fails to find subds metadata whenever subds doesn't have it aggregated and is installed

For upcoming ///openneuro I am aggregating metadata into that superdataset from subdatasets (https://github.com/datalad/datalad-crawler/pull/28/files#diff-8e8fc59a503a8bdc5f90e33e16d020b7R146), which seems to work lovely. But then I have discovered that I cannot query metadata within subdataset whenever it is installed:

$> datalad ls ds000008 
ds000008   [annex]  master  ✗ 2018-07-13/21:41:18  ✓

$> datalad -f json_pp metadata --reporton all ds000008/sub-01/anat/sub-01_inplaneT2.nii.gz
[WARNING] Found no aggregated metadata info file /mnt/btrfs/scrap/datalad/openneuro-samples/ds000008/.datalad/metadata/aggregate_v1.json. Found following info files, which might have been generated with newer version(s) of datalad: .datalad/metadata/aggregate_v1.json. You will likely need to either update the dataset from its original location,, upgrade datalad or reaggregate metadata locally. 
[WARNING] Dataset at . contains no aggregated metadata on this path [metadata(/mnt/btrfs/scrap/datalad/openneuro-samples/ds000008/sub-01/anat/sub-01_inplaneT2.nii.gz)] 
{
  "action": "metadata", 
  "path": "/mnt/btrfs/scrap/datalad/openneuro-samples/ds000008/sub-01/anat/sub-01_inplaneT2.nii.gz", 
  "status": "impossible", 
  "type": "file"
}

whenever it works fine as soon as I uninstall it

$> datalad uninstall ds000008 
uninstall(ok): /mnt/btrfs/scrap/datalad/openneuro-samples/ds000008 (dataset)
action summary:
  drop (notneeded: 1)
  uninstall (ok: 1)

$> datalad -f json_pp metadata --reporton all ds000008/sub-01/anat/sub-01_inplaneT2.nii.gz | head -n 10
{
  "action": "metadata", 
  "dsid": "ada6363e-8706-11e8-b2f9-0242ac12001e", 
  "metadata": {
    "@context": {
      "@vocab": "http://docs.datalad.org/schema_v2.0.json"
    }, 
    "annex": {
      "key": "MD5E-s592833--8d37e298f8afe55cfd04a19870399198.nii.gz"
    }, 
...

Metadata extractor availability should be checked before extraction

ATM extraction will error out whenever deep in the process a configured extractor is found to be unavailable. This should be checked upfront and maybe indicated with some flag whether it is OK to proceed nevertheless. This could be combined with the switches for incremental aggregation.

datalad / datalad-metalad Goto Github PK

datalad-metalad's People

Stargazers

Watchers

Forkers

datalad-metalad's Issues

What is the problem?

What is the problem?

What is the problem?

What is the problem?

Additional points which came to mind:

What is the problem?

What steps will reproduce the problem?

What version of DataLad are you using (run datalad --version)? On what operating system (consider running datalad wtf)?

Is there anything else that would be useful to know in this context?

Have you had any success using DataLad before? (to assess your expertise/prior luck. We would welcome your testimonial additions to https://github.com/datalad/datalad/wiki/Testimonials as well)

What is the problem?

Dataset (state)

Subdataset

File

Content

What is the problem?

What is the problem?

What is the problem?

What is the problem?

Proposal

Appveyor failures on master:

Appveyor failures on #34

Plan

Extraction

Aggregation

Access

Standard, with a definable source

identifier [recommended by Google]

contributor

hasPart

isPartOf

distribution

distribution.contentUrl

dateCreated

(distribution.)dateModified

distribution.uploadDate

distribution.name

distribution.url

distribution.description

distribution.identifier

provider

version [recommended by Google]

includedInDataCatalog [recommended by Google]

Should be standard, but have no standard source

license

keywords

Recommend Projects

Recommend Topics

Recommend Org

What version of DataLad are you using (run `datalad --version`)? On what operating system (consider running `datalad wtf`)?

`identifier` [recommended by Google]

`contributor`

`hasPart`

`isPartOf`

`distribution`

`distribution.contentUrl`

`dateCreated`

`(distribution.)dateModified`

`distribution.uploadDate`

`distribution.name`

`distribution.url`

`distribution.description`

`distribution.identifier`

`provider`

`version` [recommended by Google]

`includedInDataCatalog` [recommended by Google]

`license`

`keywords`