Giter Club home page Giter Club logo

ro-crate-py's Introduction

Python package Upload Python Package PyPI version DOI

ro-crate-py is a Python library to create and consume Research Object Crates. It currently supports the RO-Crate 1.1 specification.

Installation

ro-crate-py requires Python 3.7 or later. The easiest way to install is via pip:

pip install rocrate

To install manually from this code base (e.g., to try the latest development revision):

git clone https://github.com/ResearchObject/ro-crate-py
cd ro-crate-py
pip install .

Usage

Creating an RO-Crate

In its simplest form, an RO-Crate is a directory tree with an ro-crate-metadata.json file at the top level. This file contains metadata about the other files and directories, represented by data entities. These metadata consist both of properties of the data entities themselves and of other, non-digital entities called contextual entities. A contextual entity can represent, for instance, a person, an organization or an event.

Suppose Alice and Bob worked on a research task together, which resulted in a manuscript written by both; additionally, Alice prepared a spreadsheet containing the experimental data, which Bob used to generate a diagram. We will create placeholder files for these documents:

mkdir exp
touch exp/paper.pdf
touch exp/results.csv
touch exp/diagram.svg

Let's make an RO-Crate to package all this:

from rocrate.rocrate import ROCrate

crate = ROCrate()
paper = crate.add_file("exp/paper.pdf", properties={
    "name": "manuscript",
    "encodingFormat": "application/pdf"
})
table = crate.add_file("exp/results.csv", properties={
    "name": "experimental data",
    "encodingFormat": "text/csv"
})
diagram = crate.add_file("exp/diagram.svg", dest_path="images/figure.svg", properties={
    "name": "bar chart",
    "encodingFormat": "image/svg+xml"
})

We've started by adding the data entities. Now we need contextual entities to represent Alice and Bob:

from rocrate.model.person import Person

alice_id = "https://orcid.org/0000-0000-0000-0000"
bob_id = "https://orcid.org/0000-0000-0000-0001"
alice = crate.add(Person(crate, alice_id, properties={
    "name": "Alice Doe",
    "affiliation": "University of Flatland"
}))
bob = crate.add(Person(crate, bob_id, properties={
    "name": "Bob Doe",
    "affiliation": "University of Flatland"
}))

At this point, we have a representation of the various entities. Now we need to express the relationships between them. This is done by adding properties that reference other entities:

paper["author"] = [alice, bob]
table["author"] = alice
diagram["author"] = bob

You can also add whole directories together with their contents. In an RO-Crate, a directory is represented by the Dataset entity. Create a directory with some placeholder files:

mkdir exp/logs
touch exp/logs/log1.txt
touch exp/logs/log2.txt

Now add it to the crate:

logs = crate.add_dataset("exp/logs")

Finally, we serialize the crate to disk:

crate.write("exp_crate")

Now the exp_crate directory should contain copies of all the files we added and an ro-crate-metadata.json file with a JSON-LD representation of the entities and relationships we created. Note that we have chosen a different destination path for the diagram, while the other two files have been placed at the top level with their names unchanged (the default).

Exploring the exp_crate directory, we see that all files and directories contained in exp/logs have been added recursively to the crate. However, in the ro-crate-metadata.json file, only the top level Dataset with @id "exp/logs" is listed. This is because we used crate.add_dataset("exp/logs") rather than adding every file individually. There is no requirement to represent every file and folder within the crate in the ro-crate-metadata.json file. If you do want to add files and directories recursively to the metadata, use crate.add_tree instead of crate.add_dataset (but note that it only works on local directory trees).

Some applications and services support RO-Crates stored as archives. To save the crate in zip format, use write_zip:

crate.write_zip("exp_crate.zip")

Appending elements to property values

What ro-crate-py entities actually store is their JSON representation:

paper.properties()
{
  "@id": "paper.pdf",
  "@type": "File",
  "name": "manuscript",
  "encodingFormat": "application/pdf",
  "author": [
    {"@id": "https://orcid.org/0000-0000-0000-0000"},
    {"@id": "https://orcid.org/0000-0000-0000-0001"},
  ]
}

When paper["author"] is accessed, a new list containing the alice and bob entities is generated on the fly. For this reason, calling append on paper["author"] won't actually modify the paper entity in any way. To add an author, use the append_to method instead:

donald = crate.add(Person(crate, "https://en.wikipedia.org/wiki/Donald_Duck", properties={
  "name": "Donald Duck"
}))
paper.append_to("author", donald)

Note that append_to also works if the property to be updated is missing or has only one value:

for n in "Mickey_Mouse", "Scrooge_McDuck":
    p = crate.add(Person(crate, f"https://en.wikipedia.org/wiki/{n}"))
    donald.append_to("follows", p)

Adding remote entities

Data entities can also be remote:

input_data = crate.add_file("http://example.org/exp_data.zip")

By default the file won't be downloaded, and will be referenced by its URI in the serialized crate:

{
  "@id": "http://example.org/exp_data.zip",
  "@type": "File"
},

If you add fetch_remote=True to the add_file call, however, the library (when crate.write is called) will try to download the file and include it in the output crate.

Another option that influences the behavior when dealing with remote entities is validate_url, also False by default: if it's set to True, when the crate is serialized, the library will try to open the URL to add / update metadata bits such as the content's length and format (but it won't try to download the file unless fetch_remote is also set).

Adding entities with an arbitrary type

An entity can be of any type listed in the RO-Crate context. However, only a few of them have a counterpart (e.g., File) in the library's class hierarchy (either because they are very common or because they are associated with specific functionality that can be conveniently embedded in the class implementation). In other cases, you can explicitly pass the type via the properties argument:

from rocrate.model.contextentity import ContextEntity

hackathon = crate.add(ContextEntity(crate, "#bh2021", properties={
    "@type": "Hackathon",
    "name": "Biohackathon 2021",
    "location": "Barcelona, Spain",
    "startDate": "2021-11-08",
    "endDate": "2021-11-12"
}))

Note that entities can have multiple types, e.g.:

    "@type" = ["File", "SoftwareSourceCode"]

Consuming an RO-Crate

An existing RO-Crate package can be loaded from a directory or zip file:

crate = ROCrate('exp_crate')  # or ROCrate('exp_crate.zip')
for e in crate.get_entities():
    print(e.id, e.type)
./ Dataset
ro-crate-metadata.json CreativeWork
paper.pdf File
results.csv File
images/figure.svg File
https://orcid.org/0000-0000-0000-0000 Person
https://orcid.org/0000-0000-0000-0001 Person

The first two entities shown in the output are the root data entity and the metadata file descriptor, respectively. The former represents the whole crate, while the latter represents the metadata file. These are special entities managed by the ROCrate object, and are always present. The other entities are the ones we added in the section on RO-Crate creation.

As shown above, get_entities allows to iterate over all entities in the crate. You can also access only data entities with crate.data_entities and only contextual entities with crate.contextual_entities. For instance:

for e in crate.data_entities:
    author = e.get("author")
    if not author:
        continue
    elif isinstance(author, list):
        print(e.id, [p["name"] for p in author])
    else:
        print(e.id, repr(author["name"]))
paper.pdf ['Alice Doe', 'Bob Doe']
results.csv 'Alice Doe'
images/figure.svg 'Bob Doe'

You can fetch an entity by its @id as follows:

article = crate.dereference("paper.pdf")

Advanced features

Modifying the crate from JSON-LD dictionaries

The add_jsonld method allows to add a contextual entity directly from a JSON-LD dictionary containing at least the @id and @type keys:

crate.add_jsonld({
    "@id": "https://orcid.org/0000-0000-0000-0000",
    "@type": "Person",
    "name": "Alice Doe"
})

Existing entities can be updated from JSON-LD dictionaries via update_jsonld:

crate.update_jsonld({
    "@id": "https://orcid.org/0000-0000-0000-0000",
    "name": "Alice K. Doe"
})

There is also an add_or_update_jsonld method that adds the entity if it's not already in the crate and updates it if it already exists (note that, when updating, the @type key is ignored). This allows to "patch" an RO-Crate from a JSON-LD file. For instance, suppose you have the following patch.json file:

{
    "@graph": [
        {
            "@id": "./",
            "author": {"@id": "https://orcid.org/0000-0000-0000-0001"}
        },
        {
            "@id": "https://orcid.org/0000-0000-0000-0001",
            "@type": "Person",
            "name": "Bob Doe"
        }
    ]
}

Then the following sets Bob as the author of the crate according to the above file:

crate = ROCrate("temp-crate")
with open("patch.json") as f:
    json_data = json.load(f)
for d in json_data.get("@graph", []):
    crate.add_or_update_jsonld(d)

Command Line Interface

ro-crate-py includes a hierarchical command line interface: the rocrate tool. rocrate is the top-level command, while specific functionalities are provided via sub-commands. Currently, the tool allows to initialize a directory tree as an RO-Crate (rocrate init) and to modify the metadata of an existing RO-Crate (rocrate add).

$ rocrate --help
Usage: rocrate [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  add
  init
  write-zip

Crate initialization

The rocrate init command explores a directory tree and generates an RO-Crate metadata file (ro-crate-metadata.json) listing all files and directories as File and Dataset entities, respectively.

$ rocrate init --help
Usage: rocrate init [OPTIONS]

Options:
  --gen-preview         Generate a HTML preview file for the crate.
  -e, --exclude NAME    Exclude files or directories from the metadata file.
                        NAME may be a single name or a comma-separated list of
                        names.
  -c, --crate-dir PATH  The path to the root data entity of the crate.
                        Defaults to the current working directory.
  --help                Show this message and exit.

The command acts on the current directory, unless the -c option is specified. The metadata file is added (overwritten if present) to the directory at the top level, turning it into an RO-Crate.

Adding items to the crate

The rocrate add command allows to add file, datasets (directories), workflows and other entity types (currently testing-related metadata) to an RO-Crate:

$ rocrate add --help
Usage: rocrate add [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  dataset
  file
  test-definition
  test-instance
  test-suite
  workflow

Note that data entities (e.g., workflows) must already be present in the directory tree: the effect of the command is to register them in the metadata file.

Example

# From the ro-crate-py repository root
cd test/test-data/ro-crate-galaxy-sortchangecase

This directory is already an RO-Crate. Delete the metadata file to get a plain directory tree:

rm ro-crate-metadata.json

Now the directory tree contains several files and directories, including a Galaxy workflow and a Planemo test file, but it's not an RO-Crate since there is no metadata file. Initialize the crate:

rocrate init

This creates an ro-crate-metadata.json file that lists files and directories rooted at the current directory. Note that the Galaxy workflow is listed as a plain File:

{
  "@id": "sort-and-change-case.ga",
  "@type": "File"
}

To register the workflow as a ComputationalWorkflow:

rocrate add workflow -l galaxy sort-and-change-case.ga

Now the workflow has a type of ["File", "SoftwareSourceCode", "ComputationalWorkflow"] and points to a ComputerLanguage entity that represents the Galaxy workflow language. Also, the workflow is listed as the crate's mainEntity (this is required by the Workflow RO-Crate profile, a subtype of RO-Crate which provides extra specifications for workflow metadata).

To add workflow testing metadata to the crate:

rocrate add test-suite -i test1
rocrate add test-instance test1 http://example.com -r jobs -i test1_1
rocrate add test-definition test1 test/test1/sort-and-change-case-test.yml -e planemo -v '>=0.70'

To add files or directories after crate initialization:

cp ../sample_file.txt .
rocrate add file sample_file.txt -P name=sample -P description="Sample file"
cp -r ../test_add_dir .
rocrate add dataset test_add_dir

The above example also shows how to set arbitrary properties for the entity with -P. This is supported by most rocrate add subcommands.

$ rocrate add workflow --help
Usage: rocrate add workflow [OPTIONS] PATH

Options:
  -l, --language [cwl|galaxy|knime|nextflow|snakemake|compss|autosubmit]
                                  The workflow language.
  -c, --crate-dir PATH            The path to the root data entity of the
                                  crate. Defaults to the current working
                                  directory.
  -P, --property KEY=VALUE        Add an additional property to the metadata
                                  for this entity. Can be used multiple times
                                  to set multiple properties.
  --help                          Show this message and exit.

License

  • Copyright 2019-2024 The University of Manchester, UK
  • Copyright 2020-2024 Vlaams Instituut voor Biotechnologie (VIB), BE
  • Copyright 2020-2024 Barcelona Supercomputing Center (BSC), ES
  • Copyright 2020-2024 Center for Advanced Studies, Research and Development in Sardinia (CRS4), IT
  • Copyright 2022-2024 École Polytechnique Fédérale de Lausanne, CH
  • Copyright 2024 Data Centre, SciLifeLab, SE

Licensed under the Apache License, version 2.0 https://www.apache.org/licenses/LICENSE-2.0, see the file LICENSE.txt for details.

Cite as

DOI

The above DOI corresponds to the latest versioned release as published to Zenodo, where you will find all earlier releases. To cite ro-crate-py independent of version, use https://doi.org/10.5281/zenodo.3956493, which will always redirect to the latest release.

You may also be interested in the paper Packaging research artefacts with RO-Crate.

ro-crate-py's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ro-crate-py's Issues

Can't read recent WorkflowHub crates

Reported by @lrodrin and @jmfernandez at yesterday's community meeting. One such example is https://workflowhub.eu/workflows/244.

>>> from rocrate.rocrate import ROCrate
>>> crate = ROCrate("/tmp/workflow-244-3")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/simleo/git/ro-crate-py/rocrate/rocrate.py", line 90, in __init__
    source = self.__read(source, gen_preview=gen_preview)
  File "/home/simleo/git/ro-crate-py/rocrate/rocrate.py", line 132, in __read
    self.__read_data_entities(entities, source, gen_preview)
  File "/home/simleo/git/ro-crate-py/rocrate/rocrate.py", line 177, in __read_data_entities
    metadata_id, root_id = self.find_root_entity_id(entities)
  File "/home/simleo/git/ro-crate-py/rocrate/rocrate.py", line 150, in find_root_entity_id
    if conformsTo and conformsTo.startswith("https://w3id.org/ro/crate/"):
AttributeError: 'list' object has no attribute 'startswith'

Bug? Fails to load a crate file without its data entities (directories)

I think this is a bug.

On loading a directory with ONLY the sample file from here: https://raw.githubusercontent.com/ResearchObject/example-ro-sample-image-crate/main/sample-crate/ro-crate-metadata.jsonld

from rocrate.rocrate import ROCrate

crate = ROCrate("sample") 

It dies like this:

Traceback (most recent call last):
  File "/Users/pt/working/ro-crate-py/test.py", line 3, in <module>
    crate = ROCrate("sample") 
  File "/Users/pt/rocrate/lib/python3.9/site-packages/rocrate/rocrate.py", line 92, in __init__
    self.build_crate(entities, source_path, gen_preview)
  File "/Users/pt/rocrate/lib/python3.9/site-packages/rocrate/rocrate.py", line 226, in build_crate
    raise Exception('Directory not found')
Exception: Directory not found

Expected behaviour:

I would expect to be able to load an incomplete crate if I am:

  • Writing an HTML renderer
  • Writing a validator and I want to report all the errors (rather than dying on the first one

Add optional minimal validation

Tough RO-Crate's philosophy is to impose few strict requirements, and rely on examples and best practices for details, a certain degree of validation can be performed. We could add an option to do that. The default should be no validation though, both for backwards compatibility and to continue reaping the benefits of RO-Crate's flexibility.

We could start with an option for minimal validation (even explicitly checking that the metadata file is valid JSON would be something) and perhaps plug in CheckMyCrate at some point.

Handle duplicates in property values

from rocrate.rocrate import ROCrate
from rocrate.model.person import Person

crate = ROCrate()
john = crate.add(Person(crate, "#johndoe"))
jane = crate.add(Person(crate, "#janedoe"))
crate.root_dataset["author"] = [john, jane, john]
crate.root_dataset.properties()
{'@id': './',
 '@type': 'Dataset',
 'datePublished': '2022-07-20T10:25:39+00:00',
 'author': [{'@id': '#johndoe'}, {'@id': '#janedoe'}, {'@id': '#johndoe'}]}

I.e., the JSON-LD is not properly flattened. Note that, while in the above example the API user can easily avoid generating the duplicate, in the general case it may be much trickier to even notice that one is being generated (e.g., subsequent calls to Entity.append_to in different sections of the code).

This should be dealt with in "real time", so that the crate stays flattened at all times and assertions like len(crate.root_dataset["author"]) == 2 don't fail while one is still working on it. Since lookup by value in a list is O(n), extending a property with subsequent calls to append_to would become quadratic. We should therefore switch to sets for property values, which is also closer to their actual semantics, since they have no predefined order. Should we then add support for JSON-LD lists? Are they supported / do they make sense in Schema.org / RO-Crate?

Support indirect data entity linking from root

Currently we detect data entities only if they are directly linked to from the root data entity:

{
    "@id": "./",
    "@type": "Dataset",
    "hasPart": [
        {"@id": "spam"},
        {"@id": "spam/foo.txt"}
    ]
},
{
    "@id": "spam",
    "@type": "Dataset",
    "hasPart": [
        {"@id": "spam/foo.txt"}
    ]
},
{
    "@id": "spam/foo.txt",
    "@type": "File"
}

We should also support indirect linking as specified in the specs. In the above example, for instance, we should treat spam/foo.txt as a data entity even if it wasn't listed in the root data entity's hasPart.

Delete entity support

Right now RO-Crate py is a one way for putting in data.
Scientists will also want to use this tool for managing their data.
This means that putting in a delete function would be useful.
Another interesting enhancement would be the ability to replace certain files.
Say you have a spreadsheet that you have linked a few attributes to and you want to replace this one with an updated spreadsheet.
It would be handy for the user, in this case, to be able to just replace the old file in the RO-Crate with the new one and keep all the attribute metadata of the old file to save time.

Adding a directory to a new crate gives an error when trying to write the crate: AttributeError: 'str' object has no attribute 'exists'

Adding a directory to a new crate gives an error when trying to write the crate: AttributeError: 'str' object has no attribute 'exists'

See the below example:

import os
from rocrate.rocrate import ROCrate
crate = ROCrate()
os.makedirs("tmp3", exist_ok=True)
dataset_entity = crate.add_directory(source="tmp3", dest_path="new_tmp")
crate.write("./new_crate4")
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_9363/134362041.py in <module>
      4 open('tmp3/empty_file', 'w').close()
      5 dataset_entity = crate.add_directory(source="tmp", dest_path="new_tmp")
----> 6 crate.write("./new_crate3")

~/Downloads/Elixir/ro-crate-py/rocrate/rocrate.py in write(self, base_path)
    473             self._copy_unlisted(self.source, base_path)
    474         for writable_entity in self.data_entities + self.default_entities:
--> 475             writable_entity.write(base_path)
    476 
    477     write_crate = write  # backwards compatibility

~/Downloads/Elixir/ro-crate-py/rocrate/model/dataset.py in write(self, base_path)
     49         else:
     50             out_path.mkdir(parents=True, exist_ok=True)
---> 51             if not self.crate.source and self.source and self.source.exists():
     52                 self.crate._copy_unlisted(self.source, out_path)
     53 

AttributeError: 'str' object has no attribute 'exists'

Entity id should not be modifiable

It's used to index the entity in the crate's __entity_map, so changing it leads to inconsistencies:

>>> from rocrate.rocrate import ROCrate
>>> crate = ROCrate()
>>> d = crate.add_dataset("FOO")
>>> crate._ROCrate__entity_map
{..., 'arcp://uuid,2f145cc1-20be-4cd7-ac86-d6d4a08cdcf9/FOO': <FOO/ Dataset>}
>>> d.id = "foo"
>>> crate._ROCrate__entity_map
{..., 'arcp://uuid,2f145cc1-20be-4cd7-ac86-d6d4a08cdcf9/FOO': <foo Dataset>}
>>> crate.dereference("foo")
>>> crate.dereference("FOO")
<foo Dataset>

Entity getitem breaks for null JSON values

Example:

    {
      "@id": "sort-and-change-case.ga",
      "@type": [
        "File",
        "SoftwareSourceCode",
        "ComputationalWorkflow"
      ],
      "programmingLanguage": {
        "@id": "#galaxy"
      },
      "name": null
    }
>>> crate.mainEntity["name"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/simleo/git/ro-crate-py/rocrate/model/entity.py", line 81, in __getitem__
    deref_values = [self.crate.dereference(_["@id"], _["@id"]) for _ in values]
  File "/home/simleo/git/ro-crate-py/rocrate/model/entity.py", line 81, in <listcomp>
    deref_values = [self.crate.dereference(_["@id"], _["@id"]) for _ in values]
TypeError: 'NoneType' object is not subscriptable

Windows compliance broke

Traceback (most recent call last):
  File ".\ro_crate_example_readme.py", line 11, in <module>
    wf_crate.write_zip(out_path)
  File "C:\Users\bertd\AppData\Local\Programs\Python\Python38\lib\site-packages\rocrate\rocrate.py", line 367, in write_zip
    writable_entity.write_zip(zf)
  File "C:\Users\bertd\AppData\Local\Programs\Python\Python38\lib\site-packages\rocrate\model\file.py", line 127, in write_zip
    zip_out.write(self.source, self.id)
  File "C:\Users\bertd\AppData\Local\Programs\Python\Python38\lib\zipfile.py", line 1739, in write
    zinfo = ZipInfo.from_file(filename, arcname,
  File "C:\Users\bertd\AppData\Local\Programs\Python\Python38\lib\zipfile.py", line 521, in from_file
    st = os.stat(filename)

write_zip should not include the new archive in itself

If one calls write_zip with an output path that is within the RO-crate directory, the new zip file ends up inside itself. As write_zip walks the directory tree it'll find the partially written zip file and copy it into the archive.

Improve support for context extensions

#53 added initial support for context extensions, but this is currently limited to updating the context with testing namespace terms when a testing-related entity is added to a crate via specific methods such as add_test_suite. The main thing that remains to be covered is preserving any context extensions when a crate is read, so they don't get lost in the read-write round trip. Also, when adding an entity to a crate, we should probably add only the relevant extension to the context, rather than dumping the whole shebang like we're now doing for testing-related entities.

See ResearchObject/ro-crate-ruby#17.

Entity get/set item can make the process hang indefinitely

Entity's __getitem__ and __setitem__ can lead to weird behavior.

from rocrate.rocrate import ROCrate
from rocrate.model.person import Person

crate = ROCrate("test/test-data/ro-crate-galaxy-sortchangecase")
print(crate.root_dataset["mentions"])
joe = Person(crate, "joe")
print(crate.root_dataset["mentions"] + [joe])
crate.root_dataset["mentions"] += [joe]
print(crate.root_dataset["mentions"])

This prints:

[<#test1 TestSuite>]
[<#test1 TestSuite>, <#joe Person>]
[<#test1 TestSuite>, '#joe']

which is not what one would expect from += (crate.root_dataset["mentions"] should be equal to [<#test1 TestSuite>, <#joe Person>]. Even worse, the following makes the program hang indefinitely:

crate.root_dataset["mentions"] += joe

write_zip fails for read_crate

from rocrate.rocrate import ROCrate

crate = ROCrate("test/test-data/read_crate")
crate.write_zip("/tmp/crate.zip")
FileNotFoundError: [Errno 2] No such file or directory: 'https://raw.githubusercontent.com/ResearchObject/ro-crate-py/master/test/test-data/sample_file.txt'

ro-crate profile functionality

Hi, I'm looking for a way to add a custom profile to an ro-crate. Something like crate.conforms_to("<my_profile_uri>"). Is something like that under development somewhere? Otherwise I could raise a PR with my local edits.

Thanks!

CLI: make specifying IDs more user-friendly

This is not super-friendly, and exposes the detail that IDs different from absolute URLs need to start with #:

rocrate add test-suite -i \#test1

We should probably allow the following:

rocrate add test-suite -i test1  # creates a "#test1" identity
rocrate add test-suite -i http://example.com/test2

It should be possible to do that by checking if the string being passed is a full URL, and prepending a # if it's not. This would have to be entity-specific though: a test definition, for instance, is a data entity, so we never have to prepend # to its ID.

Exclude option for crate init

It would be nice to support an exclude option to avoid considering certain sub-paths when initializing an RO-Crate from a directory tree, i.e., ROCrate(source, init=True). For instance, from the command line, you might want to run:

rocrate init --exclude .git

Ensure Entity.properties yields ready-to-write JSON

This line in the root dataset code auto-converts datePublished into a Python datetime object:

default_properties = {'datePublished': datetime.datetime.now()}

Things are fine when writing an RO-Crate, since this is handled in the write method of the Medatata class. However, users might want to generate the JSON metadata for the crate without necessarily writing it out. We should return serialized JSON from the properties method.

test_remote_uri_exceptions fails if run as root

The root user ignores the 444 mode on the subdir, so the expected exception is not raised:

E           Failed: DID NOT RAISE <class 'PermissionError'>

Better find a different use case for generating an error that's not a URL access problem.

Download Externally Defined Data

Hey I have a small (but maybe annoying to implement) request. Let me know what you think. When I think of Research Objects, I think big bulky static objects however, they can be incredibly portable because users can define external data. This means that RO-Crates are incredibly portable! It would be awesome to be able to turn these lightweight crates into their big bulky archive-ready formats with the stoke of a key.

Use Case: A researcher downloads an RO from a repository and wants to resolve all of the externally defined data in ro-crate-metadata.jsonld.

I don't think that this is a particularly common use case however, it's one of the primary reasons people use other formats such as BD-Bag (gives them a fetch.txt and their software can resolve entries). RO-Crate stores equivalent information and with a little parsing, should be able to be feature parity. I find that to be a really neat feature that's definitely worth advertising.

"file" is a reserved keyword

Hi guys. I have known that using file as declared in the module (as in file.py) may have some bad consequences (e.g. shadowing), since it is a reserved keyword.

Track files/dirs not listed in the metadata file

From the spec:

Payload files may appear directly in the RO-Crate Root alongside the RO-Crate Metadata File, and/or appear in sub-directories of the RO-Crate Root. Each file and directory MAY be represented as Data Entities in the RO-Crate Metadata File.

It is important to note that the RO-Crate Metadata File is not an exhaustive manifest or inventory, that is, it does not necessarily list or describe all files in the package.

Currently, when reading an ro-crate from a directory, files not listed in the metadata file are ignored. Thus, when the crate is written out to some other location, those files are not kept.

Use gxformat2 to convert .ga to .cwl?

Came up at the 2020 Elixir biohackathon.

Experimented with this in https://github.com/ResearchObject/ro-crate-py/tree/gxformat2_cwl_conv. Here are the changes. I checked the output from converting test/test-data/test_galaxy_wf.ga and the one output by gxformat2 is very different from the one obtained with galaxy2cwl. I'm not even sure the latter is a valid CWL workflow. Did I use the gxformat2 API in the wrong way? If not, maybe this needs to be checked by a CWL expert.

Copy error when adding files to an existing rocrate

Problem: Trying to add a file to an existing ro-crate results into a copy-error that keeps the rocrate from being updated.

Use-Case: 2 scientists working together on a project and making the rocrate in multiple times. (re instantiating the same rocrate multiple times.)

Solution: An internal check if a file is already present in the rocrate folder and if already present ignore the copy

Duplicate file copies when writing a crate

from rocrate.rocrate import ROCrate

crate = ROCrate("test/test-data/read_crate")
crate.write_zip("/tmp/crate.zip")

Running the above program, with:

  • the library instrumented to show where file copies come from
  • the RO-Crate metadata file changed to also list examples/README.txt as a File
  • sorting the output
test/test-data/read_crate/abstract_wf.cwl File
test/test-data/read_crate/examples/README.txt Dataset
test/test-data/read_crate/examples/README.txt File
test/test-data/read_crate/ro-crate-metadata.jsonld ROCrate
test/test-data/read_crate/ro-crate-preview.html File
test/test-data/read_crate/test_file_galaxy.txt File
test/test-data/read_crate/test_galaxy_wf.ga File
test/test-data/read_crate/test/test-metadata.json Dataset
test/test-data/read_crate/test/test-metadata.json ROCrate

Bug: When ingesting a File entity its @id gets a # suffix

If I ingest the below crate and then inspect the @id then {"@id": "test.csv", "@type": "File"} turns into

{'@id': '#test.csv', '@type': 'File'}

I have tried this with the file in the directory and without - same result.

Using code like this:

crate = ROCrate("./") 
for e in crate.get_entities():
    print(e.as_jsonld())  # JSON entry

{
  "@context": [
    "https://w3id.org/ro/crate/1.1/context",
    {
      "@vocab": "http://schema.org/"
    },
    {
      "@base": null
    }
  ],
  "@graph": [
    {
      "@id": "#collection",
      "@type": "RepositoryCollection ",
      "name": "Test collection"
    },
    {
      "@id": "./",
      "@type": "Dataset",
      "hasFile": [{"@id": "test.csv"}],
      "hasPart": [
        {
          "@id": "#collection"
        }
      ],
      "name": "testing hasPart"
    },
    {"@id": "test.csv", "@type": "File"},
    {
      "@id": "ro-crate-metadata.json",
      "@type": "CreativeWork",
      "about": {
        "@id": "./"
      },
      "identifier": "ro-crate-metadata.json"
    }


  ]
}

Iterate through graph

Hey,
is there an easy way to iterate/walk through the graph in python?
As far as I see it in version 0.7: the top-node is parsed and then upon request one can go to a different node. Is there an automatic function to iterate/walk through each node?
Thanks, Steffen

Fix behavior wrt "missing" data entities

After the merge of #75, we allow data entities whose @id does not map to an existing file or directory, even if local. While this adds flexibility for use cases like #73, such crates are not supported by other libraries. Specifically, I got reports of crates generated with ro-crate-py that lead to errors when submission to WorkflowHub is attempted. The main issue is that it's too easy to add a "missing" data entity:

crate.add_file("/this/does/not/exist")  # Adds a File entity with @id = "exist"

This means it can be done involuntarily, by simply mistyping the path. It should be harder to do so (advanced usage), and we should issue a warning when it's done.

let us provide a pleasant inroad for new contributors

  • some Contributors.md - readme targetted for people who want to add
    • listing rules of engagement and the way we work over here
  • possibly also adding a (more language independent) Makefile

I suggest we use this issue to gather topical suggestions for those:

  • list the commands you all using regularly to build / test / generate docs / linting / ...
  • any other suggestions people should be following before suggesting pull requests
  • ...

and then use that to actually build those docs and (where possible) automating stuff --> makefile and/or scripts

Don't validate URLs by default

File has a validate_url kwarg (True by default) that makes it perform a request for source in order to retrieve its content size and encoding format. This is, of course, very expensive, and can make the process of reading a crate with a lot of URLs very slow.

We should:

  • change the default value to False
  • move the check to the write phase, since:
    • it only matters when updating the crate
    • if fetch_remote is also True we can avoid making two requests for the same URL

CLI subcommands to add files and directories

Add subcommands to allow this:

mkdir crate
cd crate
rocrate init  # generates minimal metadata file and nothing else
cp /some/other/path/file1 .
cp -rf /some/path/dir1 .
rocrate add file file1
rocrate add dataset dir1

Duplicate preview entry in output crate

import shutil
from rocrate.rocrate import ROCrate

shutil.rmtree("/tmp/out_crate", ignore_errors=True)
crate = ROCrate("test/test-data/read_crate")
crate.write("/tmp/out_crate")

The output crate has:

        {
            "@id": "ro-crate-preview.html",
            "@type": "CreativeWork",
            "about": {
                "@id": "./"
            }
        },

but also:

        {
            "@id": "#ro-crate-preview.html",
            "@type": "CreativeWork",
            "about": {
                "@id": "./"
            }
        },

CLI can generate an error when command help is requested

In a non-ro-crate dir:

$ rocrate add workflow --help
Traceback (most recent call last):
  File "/home/simleo/git/ro-crate-py/venv/bin/rocrate", line 11, in <module>
    load_entry_point('rocrate==0.5.0', 'console_scripts', 'rocrate')()
  File "/home/simleo/git/ro-crate-py/venv/lib/python3.6/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/simleo/git/ro-crate-py/venv/lib/python3.6/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/simleo/git/ro-crate-py/venv/lib/python3.6/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/simleo/git/ro-crate-py/venv/lib/python3.6/site-packages/click/core.py", line 1656, in invoke
    super().invoke(ctx)
  File "/home/simleo/git/ro-crate-py/venv/lib/python3.6/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/simleo/git/ro-crate-py/venv/lib/python3.6/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/simleo/git/ro-crate-py/venv/lib/python3.6/site-packages/click/decorators.py", line 38, in new_func
    return f(get_current_context().obj, *args, **kwargs)
  File "/home/simleo/git/ro-crate-py/venv/lib/python3.6/site-packages/rocrate/cli.py", line 58, in add
    state.crate = ROCrate(state.crate_dir, init=False, gen_preview=False)
  File "/home/simleo/git/ro-crate-py/venv/lib/python3.6/site-packages/rocrate/rocrate.py", line 90, in __init__
    source = self.__read(source, gen_preview=gen_preview)
  File "/home/simleo/git/ro-crate-py/venv/lib/python3.6/site-packages/rocrate/rocrate.py", line 129, in __read
    raise ValueError(f"Not a valid RO-Crate: missing {Metadata.BASENAME}")
ValueError: Not a valid RO-Crate: missing ro-crate-metadata.json

This should not happen, since the user is not actually trying to do anything ro-crate specific.

Track files/dirs not listed in the metadata file

From the spec:

Payload files may appear directly in the RO-Crate Root alongside the RO-Crate Metadata File, and/or appear in sub-directories of the RO-Crate Root. Each file and directory MAY be represented as Data Entities in the RO-Crate Metadata File.

It is important to note that the RO-Crate Metadata File is not an exhaustive manifest or inventory, that is, it does not necessarily list or describe all files in the package.

Currently, when reading an ro-crate, files or directories not listed in the metadata file are ignored. Thus, when the crate is written out to some other location, those files / directories are not kept.

Add method to get entities by type

E.g., all entities with type "File". One might want to get only entities whose type is exactly "File", or also get those that include "File" in their type list (the latter seems the most useful). Two different methods? Same method with an argument for that?

Bug: Won't load a crate with hasPart reference to a RepositoryCollection.

Loading the below test data gives an error:

File "/Users/pt/working/ro-crate-py/test.py", line 3, in <module>
    crate = ROCrate("haspart") 
  File "/Users/pt/rocrate/lib/python3.9/site-packages/rocrate/rocrate.py", line 92, in __init__
    self.build_crate(entities, source_path, gen_preview)
  File "/Users/pt/rocrate/lib/python3.9/site-packages/rocrate/rocrate.py", line 227, in build_crate
    self.add(instance)
UnboundLocalError: local variable 'instance' referenced before assignment

I would expect to be able to load anything with a flat JSON-LD structure even if it's not schema.org compliant.

Here's the offending ro-crate-metadata.json

{
  "@context": [
    "https://w3id.org/ro/crate/1.1/context",
    {
      "@vocab": "http://schema.org/"
    },
    {
      "@base": null
    }
  ],
  "@graph": [
    {
      "@id": "#collection",
      "@type": "RepositoryCollection ",
      "name": "Test collection"
    },
    {
      "@id": "./",
      "@type": "Dataset",
      "hasFile": [],
      "hasPart": [{"@id": "#collection"}],
      "name": "testing hasPart"
    },
    {
      "@id": "ro-crate-metadata.json",
      "@type": "CreativeWork",
      "about": {
        "@id": "./"
      },
      "identifier": "ro-crate-metadata.json"
    }
  ]
}

How do I ...

Some documentation I could not find.

How to:

  • Iterate over all the items in a crate's graph? It's crude but in the javacript library it's for (let item crate.getGraph()) { ... }
  • Add a custom item that is not a Person etc that I have constructed myself in JSON (in the Javascript library we say crate.addItem({"@id": "#someid", "@type": ["CorpusItem", "RepositoryObject"], "name": ....})
  • Fetch an item by @id in js:`crate.getItem("#someid")
  • Example task: Add an array of references to RepositoryObjects to the root dataset using a hasMember property and add the Repository Objects to the graph

Support for non-slash root data entity

The 1.1 spec states that the root data entity SHOULD be the string ./, but in principle it could be an arbitrary URI. This in contrast with RO-Crate 1.0, where the root data entity was always ./. We now have a find_root_entity_id method that looks for the root data entity without assuming it's ./, but other parts of the code still assume that. For instance, see the RootDataset class. We should add tests for non-slash root entities and update the code as needed.

Allow to attach partials to a crate?

Hi,

For Autosubmit, since the workflow configuration doesn't contain the information needed for RO-Crate, I used the exact same approach from COMPSs and asked users to provide a YAML file with authors & license.

Then I create the objects and attach/add to the RO-Crate-py object.

The implementation in Autosubmit is similar, but not identical to COMPSs. Other workflow managers with similar need may craft yet another way of doing the same.

It would be nice if there was a way to load RO-Crate-py entities directly from a dictionary/YAML data. Something like

from rocrate.entities.person import Person

crate = ROCrate()

with open('') as f:
  yaml_content = parser.safe_load(f)

for author in yaml_content['authors']:
  crate.add(Person.load_from_dict(author)

Not sure how to validate the format of the entities... maybe instead of YAML receive JSON-LD directly, or provide a tool/script to read SPARQL+SHACL, etc.?

Cheers
Bruno

Adding a file to a new crate gives an error when trying adding source as FTP URI: AttributeError: '_io.BufferedReader' object has no attribute 'getheader'

Adding a file to a new crate gives an error when trying adding source as FTP URI: AttributeError: '_io.BufferedReader' object has no attribute 'getheader'

See the below example:

from rocrate.rocrate import ROCrate

crate = ROCrate()
input_uri = "ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140407_D00360_0017_BH947YADXX/Project_RM8398/Sample_U5c/U5c_CCGTCC_L001_R1_001.fastq.gz"
crate.add_file(source=input_uri, fetch_remote=False)
crate.write_zip("./test/crate.zip")
------------------------------------------------------
Traceback (most recent call last):
  File "/Users/laurarodrigueznavas/PycharmProjects/ro-crate-py/test/test_laura.py", line 6, in <module>
    crate.write_zip("./test/crate.zip")
  File "/Users/laurarodrigueznavas/PycharmProjects/ro-crate-py/rocrate/rocrate.py", line 486, in write_zip
    self.write(tmp_dir)
  File "/Users/laurarodrigueznavas/PycharmProjects/ro-crate-py/rocrate/rocrate.py", line 476, in write
    writable_entity.write(base_path)
  File "/Users/laurarodrigueznavas/PycharmProjects/ro-crate-py/rocrate/model/file.py", line 50, in write
    'contentSize': response.getheader('Content-Length'),
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/tempfile.py", line 469, in __getattr__
    a = getattr(file, name)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/tempfile.py", line 469, in __getattr__
    a = getattr(file, name)
AttributeError: '_io.BufferedReader' object has no attribute 'getheader'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.