Giter Club home page Giter Club logo

spec's Introduction

The Oxford Common File Layout

Build Status

Status of this document

This version: 1.0

Latest stable version: 1.0

Editors

  • Andrew Hankinson, Bodleian Libraries, University of Oxford
  • Neil Jefferies, Bodleian Libraries, University of Oxford
  • Rosalyn Metz, Emory University
  • Julian Morley, Stanford University
  • Simeon Warner, Cornell University
  • Andrew Woods, DuraSpace

Latest editors draft

https://ocfl.io/

Communication Channels

Mailing Lists

  • [email protected]: Broader communications, announcements, meeting notifications
  • [email protected]: OCFL community discussions, announcements, support, software implementer discussions, spec interpretation, technical discussions

Slack Channel

There is an #ocfl Slack channel that you can join as part of the code4lib Slack channel.

GitHub

We use GitHub to track use cases and discussions around specific issues of the specifications. These discussions can be found at https://github.com/ocfl/Use-Cases/issues.

This document is licensed under a Creative Commons Attribution 4.0 License.

spec's People

Contributors

ahankinson avatar awoods avatar bcail avatar birkland avatar dbernstein avatar julianmorley avatar ksclarke avatar neilsjefferies avatar rosy1280 avatar srerickson avatar thomas-wrobel avatar zimeon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spec's Issues

Digest vs checksum vs hash

In the current draft we take about digests, checksums and hashes... I think it is unhelpful to use three terms when we mean the same thing. I propose that we always say "digest" for the output and that we never say "checksum" (unless we somewhere wanted to say "digest, sometimes called a checksum"). We might also use "hash" to talk about the algorithms and the calculation but I would again stick to "digest" for the output

start of Feedback on Draft

The reason this is so important is because containers either have a single entrypoint, or a programatically predictable (internally modular) set of entrypoints (e.g., see https://sci-f.github.io.

spec/index.html

Lines 113 to 120 in 9a2fd7b

<p>
This Oxford Common File Layout (OCFL) specification describes an application-independent approach to the storage
of digital information in a structured, transparent, and predictable manner. It is designed to promote long-term
object management best practices within digital repositories.
</p>
<p>
This specification covers two principle areas:
</p>

Right now the Scientific Filesystem (scif above) has internal modularity for the software inside, and it has a spot for data, but not any specification to how it's organized. This would be a really strong use case for how the standard can work with containers.

Best Practices

We might want to be conservative about saying "best practices." It is developing a standard so it's (by definition) something like that, but many will hear "best practices" and immediately have reason that it's not the absolute best.

spec/index.html

Line 116 in 9a2fd7b

object management best practices within digital repositories.

What if instead we say "reproducible practices" or "standardized practice" and use a more descriptive word than "best" ?

Two Specifications

This part

spec/index.html

Lines 122 to 125 in 9a2fd7b

<li>Structure. A normative specification of the nature of an OCFL Object (the "object-at-rest");</li>
<li>Client Behaviours. A set of recommendations for how OCFL Objects should be acted upon (the
"object-in-motion")
</li>

I think there are two (very big and different) things here that would need to eventually work together. The first part to tackle I think is just the first point, and doing so to do it with the second in mind. The second is the language to interact with it. This again smells a lot like scif --> https://sci-f.github.io/spec-v1 in that the main difference being that scif has user interface commands that "feel" a lot like how you would interact with a container (e.g. "run" "exec" "shell") but notably, it doesn't have to be installed in a container.

I think it would be stronger to package these two things somewhat separately, because even if you branded them under the same label, I can see use cases where someone might want one or the other (but not both).

spec/index.html

Lines 129 to 131 in 9a2fd7b

<p>The OCFL initiative arose from a need to have well-defined application-independent file management within
digital repositories.</p>
<p>A general observation is that the contents of a digital repository -- that is, the digital files and metadata

digital repositories such as? A file system in an old library is different from object storage, which is different from a flat database, which is different from a relational.

spec/index.html

Lines 131 to 135 in 9a2fd7b

<p>A general observation is that the contents of a digital repository -- that is, the digital files and metadata
that an institution might wish to manage -- are largely stable. Once content has been accessioned, it is
unlikely to change significantly over its lifetime. This is in contrast to the software applications that
manage these contents, which are ephemeral, requiring constant updating and replacement. Thus, transitions
between application-specific methods of file management to support software upgrades and replacement cycles

This is a really important point - I don't think that institutions even know what these metadata are, at least the ones that have awareness of it are farther along. This is probably more challenging than making the standard itself - going across software / archives in general for data or other and deciding what are the "important metadata." There needs to be a whole workflow to create groups, assign responsibilities, and go through change cycles just for doing that. This is another reason I would set aside the set of functions for now - the needs and functions for interacting with the data structure probably are going to vary based on the domain, and also might be better carried out by third party software that writes tools understanding the datastructure.

spec/index.html

Lines 159 to 162 in 9a2fd7b

An OCFL Object is a group of one or more content bitstreams (data and metadata), and their administrative
information that are together identified by a URI. The object may contain a sequence of versions of the
bitstreams that represent the evolution of the object's contents.
</p>

What do these permissions look like? POSIX? Something else? How does this vary based on traditional NFS vs an object store or a signed container?

spec/index.html

Lines 171 to 176 in 9a2fd7b

[object_root]
├── 0=ocfl_object_1.0
├── inventory.jsonld
├── inventory.jsonld.sha512
├── logs
│   └── .keep

So this is scoped to a filesystem then? A filesystem in a container?

Another gut reaction is that this sounds a lot like what people would describe as a "data container" but just with a required organization inside. Again a little bit like scif, since it has the /scif root and a predictable structure for the contents within :)

Are there any example use cases? That might be useful.

So I would say this is really great so far! Maybe for this first draft focus on just the organization, and a plan for how the community works around it? Then either solutions will start to emerge (or a need for a general set of functions) for interacting with the data structures.

Relationship to BagIt bags

The overlap and similarity between OCFL and BagIt has been noted several times.
https://tools.ietf.org/html/draft-kunze-bagit-08

BagIt bags are fundamentally composed of a directory that contains:

  • bagit version file
  • manifest of data files
  • "data" directory of arbitrary content files

The primary difference with OCFL is the administrative metadata detailing changes associated with a given version.

Would it make sense to adapt the OCFL definition of a "version" directory to look more like a BagIt bag with the addition of inventory.jsonld?

JSON-LD, JSON, or something else for versions document?

What syntax should the inventory/versions document have?

Discussion from draft document:

@zimeon: JSON-LD is great, but is in the middle of upheaval from v1.0 to v1.1. and v1.2 could easily come along in a couple more years. I have some concerns that it is too much of a moving target for an internal repository data format in systems that will have archival application

@julianmorley: +1

@ahankinson: Yes, but ... :) I would probably prefer JSON-LD in its unstable forms, over arbitrary non-typed JSON.

Avoid confusion about use of the word "object"

We talk about "OCFL objects" (the whole layout including admin metadata and versions) but we also then sometimes talk about "implementers may have different local requirements to store audit information for their digital objects" or "versions of an object" which conflates the contents of the OCLF object with the object. I think we should use object only in the first case ("OCFL object") and in the second talk about digital information or content (of which there may be multiple versions).

Client behaviors document

Client behaviors are moved to a new document, and this should be stubbed out.

The outline is currently TBD.

Reword purpose of digest in section 3.4

Section 3.4 reads

Digests play two roles in an OCFL Object. The first is that digests allow for content-addressable storage; that is, for a file to be addressed by the digest of its contents, rather than its filename. The second is that digests provide for fixity checks to determine whether a file has become corrupt through hardware degradation or malicious actors.

It is no longer the case that we use digests for content-addressable storage, files are identified by their location/filename and digest.

Add inventory file for each version of an object

From 2018-03-28 meeting notes there was a "decision" to add an inventory file to each version directory and some discussion of whether the top-level version information would be reconstructable from these inventories:

  • Moab vs OCFL
    • Moab never changes files; you only add files via new versions.
      • You have to look at more files to get to all the content in a particular version
      • You don’t duplicate storage
    • OCFL versions: the only metadata about the distinctions b/t versions is in versions.jsonld
    • Moab has manifests directory for each version; OCFL does not have that (and especially not the file diff info as a manifest)
    • Should OCFL have forward versioning manifest info for each version?
    • OCFL is sha256; currently baked in; meant for de-dup, not fixity checking
    • Reconstituting versions.jsonld is crucial to reconstituting the object
    • Decision? Add inventory.jsonld to the top level of each version subdirectory

...
Decisions
...

  • Add relevant inventory.jsonld to the top level of each version subdirectory
  • Also add [sha256].sha256 (sha of inventory.jsonld) to each version directory?

Linking to External References

For #36 we note the performance implications of using SHA-256, which is based on this post:

https://medium.com/@davidtstrauss/stop-using-sha-256-6adbb55c608

We should add a link to that post within the body of the spec as a citation, but having looked at the ReSpec documentation I am unsure of the best way to do this. It seems they want you to add it to a big global database, but I would prefer not to do that if we can help it, and it seems largely geared towards referencing other specs.

Any pointers from experienced ReSpec writers?

Version directory names -- leading zeros (and if so how many), or not?

From draft document which proposed 4 digits v0001 etc.:

@zimeon: What happens with 10,000th version? Need to define whether that is illegal (and debate number of digits) or there is extension to v10000 to deal with possible edge case (as in e.g. https://github.com/ndlib/bendo/blame/master/architecture/bundle.md#L16-L18 )

@awoods: +1

@julianmorley: It's a concern, but in the many years that Stanford has been using this format, the average version # in our repo (out of 1.5MM objects) is 2.5, and the highest version number is 20.

@neilsjefferies: Why not just drop the leading zeros, the maximum version is thus limited by filesystem naming.

@zimeon: I agree that dropping the leading zeros is probably the cleanest solution -- the combination of manually looking at the directory structure AND having > 9 versions very often seems like a real edge case

@ahankinson: The risk with dropping the leading zeroes is ASCIIbetical sorting -- v1, v10, v100, v2, v20, v200, etc.. I agree that dropping the zeroes has significant advantages, so I'm ok with it, but it would be good to recognise the disadvantages as well.

@rosy1280: +1 for some type of ASCIIbetical sorting. what would be the advantage of having all those 0s rather than just a single 0. i would imagine there is some, but i can not think of one (or remember why it was done that way although i can remember being told why...)

How can I help?

hey @zimeon, @neilsjefferies, @awoods, @ahankinson, @rosy1280. I'm in research
computing at Stanford, and was having a discussion about how the problem of
reproducibility (or encapsulation of a workflow) might be differently (maybe better?)
addressed if we spent more time looking at the organization of the data / inputs,
as opposed to "which workflow manager to use." Software (containers) could be designed
around these rules, and instead of a complicated set of inputs and outputs, the
software would understand a dataset format. Datasets would be able to find
software that work with them, and vice versa. You could also do things like move more
easily between object storage, local filesystem, and traditional databases, because
given that you can predict how files are organized, you can programatically change them
and (still) have programmatic accessibility. As an example, the Brain Imaging Data
Structure
(BIDS) has done great things for neuroimaging informatics, for exactly these reasons.

Anyway, I was having this discussion with my colleague @akkornel, and he is aptly
skilled in poking around for more information, and he found https://ocfl.io/! And I'm super happy that you exist!! This is so important. I want to ask (see title of this post) how can I help?

Rapid work and merge policy through to 2018-07-07 in person meeting

Discussed on 2018-07-18 editors' call, we propose the adopt the following policy in order to get more content up quickly before the in-person meeting:

  • UNTIL 2018-07-07 - A pull request should be merged a majority of editors have responded with 👍/+1, provided no editors have responded with 👎/-1, without any time delay requirement.

will add as extra bullet on https://github.com/OCFL/spec/wiki/Change-Suggestion-and-Resolution-Process

Requirements for inventory.jsonld

  1. Forward-delta versioning
  2. Manifest of each version's contents
  3. Fixity data for de-duplication
  4. Use of original filenames
  5. Versions in separate data buckets
  6. Support for file renaming
  7. Support for duplicate files
  8. Support for multiple locations (local and remote) within a version? - in scope?

Add also to sentence in Section 3.4.2

Section 3.4.2 currently reads:

Implementers may wish to store their file digests in a system external to their OCFL object stores at the point of ingest, to further safeguard against the possibility of malicious manipulation of file contents and digests.

It should read:

Implementers may also wish to store their file...

Decouple storage from OCFL Object

Emory: I can't necessarily keep all the files on local disk because it is too expensive. So I'll need OCFL to be able to identify what storage root the files are in.

Stanford: I'm going to move zipped (not compressed) versions to various S3 buckets and i need to track where those zipped versions went.

See also: OCFL/Use-Cases#10

Action: Add clarification of handling of empty directories to spec

  1. Need to figure out where it goes in the text
  2. Determine how we handle it. Follow the BagIt spec:

Payload manifests only include the pathnames of files. Because of
this, a payload manifest cannot reference empty directories. To
account for an empty directory, a bag creator may wish to include at
least one file in that directory; it suffices, for example, to
include a zero-length file named ".keep".

Should version information contain a digest for the previous version?

From 2018-03-28 meeting notes there was discussion about the possibility of:

JSON-LD file for each version could contain the checksum of the JSON-LD file of the previous version

  • This eliminates the additional Namaste file in each object version
  • Each version is validated entirely by the next
  • The latest version is validated by the top level SHA (since it’s JSON LD is the same as the top level one)
  • This does not need to be computed when a new version is created, it is just a copy of the top level one

Editorial policy - what is agreement?

We need to decide how we agree to move forward and decide an issue is resolved, and how we agree that a change should be merged and does indeed reflect the decision. The best experiences I've had with such processed are in IIIF and the Fedora API (both a similar size editorial group, 5 instead of 6).

Denoting non-normative text

Suggestion:

  1. Non-normative sections should be denoted with the use of:
<section id="whatever" class="informative">
  1. Non-normative blocks within a section should be denoted with the use of:
<blockquote class="informative">
  <p>
    Non-normative note:
    ...text...
  </p>
<blockquote/> 
``

Should OCFL be defined in terms of files or bitstreams?

The current definition of an OCFL object says:

An OCFL Object is a group of one or more content bitstreams (data and metadata), and their administrative information that are together identified by a URI. The object may contain a sequence of versions of the bitstreams that represent the evolution of the object's contents.

If the name of the initiative/spec includes file in the title, should we include file in the definition of an OCFL object? If so, how do we indicate that this work can be generalized to work over object stores too?

Object specification definition - revision suggestion

We discussed the definition for Objects in today's community call. It was suggested to move this to GH as a next step. I'm not able to branch and pr on the repo atm so I will post this as an issue here instead. The following captures the changes I had suggested to put the emphasis back on the concept of "files" in the primary specification definition. I'll leave it to the spec editors to do with as you please:

  1. Object Specification
    "An OCFL Object is a group of one or more content and metadata files (e.g. file1.jpg, file2.txt, file3.ead.xml, file4.dc.json). This object is identified by an URI and includes OCFL administrative information. This administrative information may contain a sequence of versions of the files that document the evolution of the object's contents. The files in the object follow a layout as prescribed by the OCFL specification."

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.