ocfl / spec Goto Github PK

View Code? Open in Web Editor NEW

52.0 19.0 14.0 2.34 MB

The Oxford Common File Layout (OCFL) specifications

Home Page: https://ocfl.io

HTML 81.82% CSS 14.08% JavaScript 3.69% Python 0.42%

ocfl digital-preservation

spec's Introduction

The Oxford Common File Layout

Status of this document

This version: 1.0

Latest stable version: 1.0

Editors

Andrew Hankinson, Bodleian Libraries, University of Oxford
Neil Jefferies, Bodleian Libraries, University of Oxford
Rosalyn Metz, Emory University
Julian Morley, Stanford University
Simeon Warner, Cornell University
Andrew Woods, DuraSpace

Latest editors draft

https://ocfl.io/

Communication Channels

Mailing Lists

[email protected]: Broader communications, announcements, meeting notifications
[email protected]: OCFL community discussions, announcements, support, software implementer discussions, spec interpretation, technical discussions

Slack Channel

There is an #ocfl Slack channel that you can join as part of the code4lib Slack channel.

GitHub

We use GitHub to track use cases and discussions around specific issues of the specifications. These discussions can be found at https://github.com/ocfl/Use-Cases/issues.

This document is licensed under a Creative Commons Attribution 4.0 License.

spec's People

Contributors

Stargazers

Watchers

Forkers

awoods ksclarke tomwrobel atomotic dbernstein eocarragain birkland bcail maxwellb acidburn0zzz srerickson zimeon npedrazzini

spec's Issues

Digest vs checksum vs hash

In the current draft we take about digests, checksums and hashes... I think it is unhelpful to use three terms when we mean the same thing. I propose that we always say "digest" for the output and that we never say "checksum" (unless we somewhere wanted to say "digest, sometimes called a checksum"). We might also use "hash" to talk about the algorithms and the calculation but I would again stick to "digest" for the output

Change OCFL Collection to OCFL Storage Root in spec

Set up CI to validate gitbook source?

I'm not sure how to do this but it would be useful to have CI running on PRs so that we know they don't break the spec. We use travis on the IIIF specs for example, I'm sure it would also be possible with circle-ci though I'm not familiar with that.

start of Feedback on Draft

The reason this is so important is because containers either have a single entrypoint, or a programatically predictable (internally modular) set of entrypoints (e.g., see https://sci-f.github.io.

spec/index.html

Lines 113 to 120 in 9a2fd7b

 <p> 

 This Oxford Common File Layout (OCFL) specification describes an application-independent approach to the storage 

 of digital information in a structured, transparent, and predictable manner. It is designed to promote long-term 

 object management best practices within digital repositories. 

 </p> 

 <p> 

 This specification covers two principle areas: 

 </p>

Right now the Scientific Filesystem (scif above) has internal modularity for the software inside, and it has a spot for data, but not any specification to how it's organized. This would be a really strong use case for how the standard can work with containers.

Best Practices

We might want to be conservative about saying "best practices." It is developing a standard so it's (by definition) something like that, but many will hear "best practices" and immediately have reason that it's not the absolute best.

spec/index.html

Line 116 in 9a2fd7b

object management best practices within digital repositories.

What if instead we say "reproducible practices" or "standardized practice" and use a more descriptive word than "best" ?

Two Specifications

This part

spec/index.html

Lines 122 to 125 in 9a2fd7b

 <li>Structure. A normative specification of the nature of an OCFL Object (the "object-at-rest");</li> 

 <li>Client Behaviours. A set of recommendations for how OCFL Objects should be acted upon (the 

 "object-in-motion") 

 </li>

I think there are two (very big and different) things here that would need to eventually work together. The first part to tackle I think is just the first point, and doing so to do it with the second in mind. The second is the language to interact with it. This again smells a lot like scif --> https://sci-f.github.io/spec-v1 in that the main difference being that scif has user interface commands that "feel" a lot like how you would interact with a container (e.g. "run" "exec" "shell") but notably, it doesn't have to be installed in a container.

I think it would be stronger to package these two things somewhat separately, because even if you branded them under the same label, I can see use cases where someone might want one or the other (but not both).

spec/index.html

Lines 129 to 131 in 9a2fd7b

 <p>The OCFL initiative arose from a need to have well-defined application-independent file management within 

 digital repositories.</p> 

 <p>A general observation is that the contents of a digital repository -- that is, the digital files and metadata

digital repositories such as? A file system in an old library is different from object storage, which is different from a flat database, which is different from a relational.

spec/index.html

Lines 131 to 135 in 9a2fd7b

 <p>A general observation is that the contents of a digital repository -- that is, the digital files and metadata 

 that an institution might wish to manage -- are largely stable. Once content has been accessioned, it is 

 unlikely to change significantly over its lifetime. This is in contrast to the software applications that 

 manage these contents, which are ephemeral, requiring constant updating and replacement. Thus, transitions 

 between application-specific methods of file management to support software upgrades and replacement cycles

This is a really important point - I don't think that institutions even know what these metadata are, at least the ones that have awareness of it are farther along. This is probably more challenging than making the standard itself - going across software / archives in general for data or other and deciding what are the "important metadata." There needs to be a whole workflow to create groups, assign responsibilities, and go through change cycles just for doing that. This is another reason I would set aside the set of functions for now - the needs and functions for interacting with the data structure probably are going to vary based on the domain, and also might be better carried out by third party software that writes tools understanding the datastructure.

spec/index.html

Lines 159 to 162 in 9a2fd7b

 An OCFL Object is a group of one or more content bitstreams (data and metadata), and their administrative 

 information that are together identified by a URI. The object may contain a sequence of versions of the 

 bitstreams that represent the evolution of the object's contents. 

 </p>

What do these permissions look like? POSIX? Something else? How does this vary based on traditional NFS vs an object store or a signed container?

spec/index.html

Lines 171 to 176 in 9a2fd7b

 [object_root] 

 ├── 0=ocfl_object_1.0 

 ├── inventory.jsonld 

 ├── inventory.jsonld.sha512 

 ├── logs 

 │   └── .keep

So this is scoped to a filesystem then? A filesystem in a container?

Another gut reaction is that this sounds a lot like what people would describe as a "data container" but just with a required organization inside. Again a little bit like scif, since it has the /scif root and a predictable structure for the contents within :)

Are there any example use cases? That might be useful.

So I would say this is really great so far! Maybe for this first draft focus on just the organization, and a plan for how the community works around it? Then either solutions will start to emerge (or a need for a general set of functions) for interacting with the data structures.

Relationship to BagIt bags

The overlap and similarity between OCFL and BagIt has been noted several times.
https://tools.ietf.org/html/draft-kunze-bagit-08

BagIt bags are fundamentally composed of a directory that contains:

bagit version file
manifest of data files
"data" directory of arbitrary content files

The primary difference with OCFL is the administrative metadata detailing changes associated with a given version.

Would it make sense to adapt the OCFL definition of a "version" directory to look more like a BagIt bag with the addition of inventory.jsonld?

JSON-LD, JSON, or something else for versions document?

What syntax should the inventory/versions document have?

Discussion from draft document:

@zimeon: JSON-LD is great, but is in the middle of upheaval from v1.0 to v1.1. and v1.2 could easily come along in a couple more years. I have some concerns that it is too much of a moving target for an internal repository data format in systems that will have archival application

@julianmorley: +1

@ahankinson: Yes, but ... :) I would probably prefer JSON-LD in its unstable forms, over arbitrary non-typed JSON.

Action: Fix hash/digest/checksum in spec

After decision made in #46

Automate notices to OCFL Slack Channel for open PRs

Avoid confusion about use of the word "object"

We talk about "OCFL objects" (the whole layout including admin metadata and versions) but we also then sometimes talk about "implementers may have different local requirements to store audit information for their digital objects" or "versions of an object" which conflates the contents of the OCLF object with the object. I think we should use object only in the first case ("OCFL object") and in the second talk about digital information or content (of which there may be multiple versions).

Client behaviors document

Client behaviors are moved to a new document, and this should be stubbed out.

The outline is currently TBD.

Reword purpose of digest in section 3.4

Section 3.4 reads

Digests play two roles in an OCFL Object. The first is that digests allow for content-addressable storage; that is, for a file to be addressed by the digest of its contents, rather than its filename. The second is that digests provide for fixity checks to determine whether a file has become corrupt through hardware degradation or malicious actors.

It is no longer the case that we use digests for content-addressable storage, files are identified by their location/filename and digest.

Add inventory file for each version of an object

From 2018-03-28 meeting notes there was a "decision" to add an inventory file to each version directory and some discussion of whether the top-level version information would be reconstructable from these inventories:

Moab vs OCFL

Moab never changes files; you only add files via new versions.

You have to look at more files to get to all the content in a particular version

You don’t duplicate storage

OCFL versions: the only metadata about the distinctions b/t versions is in versions.jsonld

Moab has manifests directory for each version; OCFL does not have that (and especially not the file diff info as a manifest)

Should OCFL have forward versioning manifest info for each version?

OCFL is sha256; currently baked in; meant for de-dup, not fixity checking

Reconstituting versions.jsonld is crucial to reconstituting the object

Decision? Add inventory.jsonld to the top level of each version subdirectory

...
Decisions
...

Add relevant inventory.jsonld to the top level of each version subdirectory

Also add [sha256].sha256 (sha of inventory.jsonld) to each version directory?

Support File Renaming in inventory.jsonld

Currently there is an issue with inventory.jsonld file where it does not support file renaming. We need to resolve this.

Relates to OCFL/Use-Cases#26

Add normative references that define the digest algorithms and the specific encodings

PR #36 adds a table of digest algorithms "known" to OCFL. We need to link these to definitions of the algorithms and the encodings that are assumed in OCFL

Develop JSON-LD context specification for OCFL

As per #1 (comment), point 3.

Write up use of JSON-LD for inventory

Follow decision in #1

Make sure we are clear about the file "name" in an OCFL object being the full path within a version

We must avoid any possible confusion with the filename being just the last component of the full path. So the name would be a full path like a/b/c/d.txt as opposed to just d.txt (which could be confused with e/d.txt).

Linking to External References

For #36 we note the performance implications of using SHA-256, which is based on this post:

https://medium.com/@davidtstrauss/stop-using-sha-256-6adbb55c608

We should add a link to that post within the body of the spec as a citation, but having looked at the ReSpec documentation I am unsure of the best way to do this. It seems they want you to add it to a big global database, but I would prefer not to do that if we can help it, and it seems largely geared towards referencing other specs.

Any pointers from experienced ReSpec writers?

Version directory names -- leading zeros (and if so how many), or not?

From draft document which proposed 4 digits v0001 etc.:

@zimeon: What happens with 10,000th version? Need to define whether that is illegal (and debate number of digits) or there is extension to v10000 to deal with possible edge case (as in e.g. https://github.com/ndlib/bendo/blame/master/architecture/bundle.md#L16-L18 )

@awoods: +1

@julianmorley: It's a concern, but in the many years that Stanford has been using this format, the average version # in our repo (out of 1.5MM objects) is 2.5, and the highest version number is 20.

@neilsjefferies: Why not just drop the leading zeros, the maximum version is thus limited by filesystem naming.

@zimeon: I agree that dropping the leading zeros is probably the cleanest solution -- the combination of manually looking at the directory structure AND having > 9 versions very often seems like a real edge case

@ahankinson: The risk with dropping the leading zeroes is ASCIIbetical sorting -- v1, v10, v100, v2, v20, v200, etc.. I agree that dropping the zeroes has significant advantages, so I'm ok with it, but it would be good to recognise the disadvantages as well.

@rosy1280: +1 for some type of ASCIIbetical sorting. what would be the advantage of having all those 0s rather than just a single 0. i would imagine there is some, but i can not think of one (or remember why it was done that way although i can remember being told why...)

Every inventory file MUST have an accompanying digest file

In discussion https://github.com/OCFL/spec/wiki/2018.07.11-Editors-Meeting we have agreed that each inventory.jsonld file MUST have an accompanying digest file inventory.jsonld.DIGEST where DIGEST may be sha512 or such, whether this be at the top level or inside a version.

How can I help?

hey @zimeon, @neilsjefferies, @awoods, @ahankinson, @rosy1280. I'm in research
computing at Stanford, and was having a discussion about how the problem of
reproducibility (or encapsulation of a workflow) might be differently (maybe better?)
addressed if we spent more time looking at the organization of the data / inputs,
as opposed to "which workflow manager to use." Software (containers) could be designed
around these rules, and instead of a complicated set of inputs and outputs, the
software would understand a dataset format. Datasets would be able to find
software that work with them, and vice versa. You could also do things like move more
easily between object storage, local filesystem, and traditional databases, because
given that you can predict how files are organized, you can programatically change them
and (still) have programmatic accessibility. As an example, the Brain Imaging Data
Structure (BIDS) has done great things for neuroimaging informatics, for exactly these reasons.

Anyway, I was having this discussion with my colleague @akkornel, and he is aptly
skilled in poking around for more information, and he found https://ocfl.io/! And I'm super happy that you exist!! This is so important. I want to ask (see title of this post) how can I help?

Rapid work and merge policy through to 2018-07-07 in person meeting

Discussed on 2018-07-18 editors' call, we propose the adopt the following policy in order to get more content up quickly before the in-person meeting:

UNTIL 2018-07-07 - A pull request should be merged a majority of editors have responded with 👍/+1, provided no editors have responded with 👎/-1, without any time delay requirement.

will add as extra bullet on https://github.com/OCFL/spec/wiki/Change-Suggestion-and-Resolution-Process

Requirements for inventory.jsonld

Forward-delta versioning
Manifest of each version's contents
Fixity data for de-duplication
Use of original filenames
Versions in separate data buckets
Support for file renaming
Support for duplicate files
Support for multiple locations (local and remote) within a version? - in scope?

Establish top-level outline of specification

Before diving too far into the details of the specification, agreement on the top-level outline would be helpful.

Versioning of the OCFL specification

We need to decide how we will version the OCFL specification and how this will be reflected in the URIs for the spec. At present we have https://ocfl.io/ and https://ocfl.io/context.jsonld and we will need https://ocfl.io/SOMETHING/ and https://ocfl.io/SOMETHING/context.jsonld , where SOMETHING is to be defined.

Should we use UK or US spelling?

Brought up in discussion of #44... we need to pick one and be consistent

Add also to sentence in Section 3.4.2

Section 3.4.2 currently reads:

Implementers may wish to store their file digests in a system external to their OCFL object stores at the point of ingest, to further safeguard against the possibility of malicious manipulation of file contents and digests.

It should read:

Implementers may also wish to store their file...

Move gist contents to the spec for commenting, etc.

n/t

What is an OCFL digital object?

We need a good definition for the glossary (https://github.com/OCFL/spec/blame/master/GLOSSARY.md#L2 ).

Use SHA512 as digest algorithm

How do we decide what digest is best as the built-in one for deduping? I note an article https://medium.com/@davidtstrauss/stop-using-sha-256-6adbb55c608 that argues that sha512 is quicker to compute than sha256 and thus that, even if you want to avoid the full length of sha512, sha512/256 (a 256 bit truncation of sha512) is thus better than sha256.

Decouple storage from OCFL Object

Emory: I can't necessarily keep all the files on local disk because it is too expensive. So I'll need OCFL to be able to identify what storage root the files are in.

Stanford: I'm going to move zipped (not compressed) versions to various S3 buckets and i need to track where those zipped versions went.

Add CI and basic validation to ReSpec

This task is to add a .travis.yml with the following restrictions on the ReSpec document:

html5validator
line length not to exceed 120 characters

Add reference to Namaste spec for conformance declaration

Mentioned in #32 (review)

Use of RFC2119 language

What words from rfc2119 will we use? Also need to add note and normative ref to spec.

Action: Fix spelling in spec to be US Spelling

Follows on decision made in #52

Auto-deploy gh-pages on commit

Currently, the hosted book ( gh-pages ) and master become out of date unless the build.sh script is run.
This issue is to instead auto-deploy to the gh-pages branch on every commit to master.
see: https://gist.github.com/domenic/ec8b0fc8ab45f39403dd

Action: Add clarification of handling of empty directories to spec

Need to figure out where it goes in the text
Determine how we handle it. Follow the BagIt spec:

Payload manifests only include the pathnames of files. Because of
this, a payload manifest cannot reference empty directories. To
account for an empty directory, a bag creator may wish to include at
least one file in that directory; it suffices, for example, to
include a zero-length file named ".keep".

Support two files with the same content in inventory.jsonld

e.g., two files with different names but with zero content.

Should version information contain a digest for the previous version?

From 2018-03-28 meeting notes there was discussion about the possibility of:

JSON-LD file for each version could contain the checksum of the JSON-LD file of the previous version

This eliminates the additional Namaste file in each object version

Each version is validated entirely by the next

The latest version is validated by the top level SHA (since it’s JSON LD is the same as the top level one)

This does not need to be computed when a new version is created, it is just a copy of the top level one

Clarify that all information about object must be serialized in the object

From 2018-03-28 meeting notes there was a "decision" about the need to:

Clarification: All the information about the object should be serialized in the object (metadata as well as …) as a consequence of the rebuildability requirement

Editorial policy - what is agreement?

We need to decide how we agree to move forward and decide an issue is resolved, and how we agree that a change should be merged and does indeed reflect the decision. The best experiences I've had with such processed are in IIIF and the Fedora API (both a similar size editorial group, 5 instead of 6).

IIIF Editorial Process - discuss in community, hope to reach community consensus, editorial discretion as to when that has occurred. Within editors' group: all but one editors +1, no -1 for normative change (see: Acceptance Criteria for Merging Changes)
Fedora API Issue Resolution Process - all editors +1 or majority editors +1, no -1 after 72 hours.

Denoting non-normative text

Suggestion:

Non-normative sections should be denoted with the use of:

<section id="whatever" class="informative">

Non-normative blocks within a section should be denoted with the use of:

<blockquote class="informative">
  <p>
    Non-normative note:
    ...text...
  </p>
<blockquote/> 
``

Editorial: link to editors'draft, institutions, list formattings

I notice a few things in the current draft:

the editors' draft links to https://ocfl.github.io/spec/ but that redirects to https://ocfl.io
missing affiliation links
an extra "2." in the numbered list in intro

Allow for user-determined checksum algorithms

Currently, SHA-512 is used only in a minority of preservation settings as the fixity checking algorithm (see the 2017 NDSA Fixty Survey Report, p22 -23). Requiring SHA-512 may be a cumbersome barrier for those who want to adopt OCFL.

Should OCFL be defined in terms of files or bitstreams?

The current definition of an OCFL object says:

An OCFL Object is a group of one or more content bitstreams (data and metadata), and their administrative information that are together identified by a URI. The object may contain a sequence of versions of the bitstreams that represent the evolution of the object's contents.

If the name of the initiative/spec includes file in the title, should we include file in the definition of an OCFL object? If so, how do we indicate that this work can be generalized to work over object stores too?

Is a `.keep` file required for an empty `logs` directory?

The current https://ocfl.io/#basic-structure shows a .keep file in the logs directory. I don't think this should be necessary as we define the logs directory as the basic part of the OCFL Object structure

Section 2: Populate Terms

This task is to add the terms (if not their definitions) to section 2: Terminology.

Object specification definition - revision suggestion

We discussed the definition for Objects in today's community call. It was suggested to move this to GH as a next step. I'm not able to branch and pr on the repo atm so I will post this as an issue here instead. The following captures the changes I had suggested to put the emphasis back on the concept of "files" in the primary specification definition. I'll leave it to the spec editors to do with as you please:

Object Specification
"An OCFL Object is a group of one or more content and metadata files (e.g. file1.jpg, file2.txt, file3.ead.xml, file4.dc.json). This object is identified by an URI and includes OCFL administrative information. This administrative information may contain a sequence of versions of the files that document the evolution of the object's contents. The files in the object follow a layout as prescribed by the OCFL specification."

Use ReSpec for writing the documentation

n/t

Clarify that the digests are for de-duping

From 2018-03-28 meeting notes there was a "decision" about the need to:

Be super clear that checksums are for de-duping, not for fixity checking (e.g. guarding against malicious acts)

	<p>
	This Oxford Common File Layout (OCFL) specification describes an application-independent approach to the storage
	of digital information in a structured, transparent, and predictable manner. It is designed to promote long-term
	object management best practices within digital repositories.
	</p>
	<p>
	This specification covers two principle areas:
	</p>

	<li>Structure. A normative specification of the nature of an OCFL Object (the "object-at-rest");</li>
	<li>Client Behaviours. A set of recommendations for how OCFL Objects should be acted upon (the
	"object-in-motion")
	</li>

	<p>The OCFL initiative arose from a need to have well-defined application-independent file management within
	digital repositories.</p>
	<p>A general observation is that the contents of a digital repository -- that is, the digital files and metadata

	<p>A general observation is that the contents of a digital repository -- that is, the digital files and metadata
	that an institution might wish to manage -- are largely stable. Once content has been accessioned, it is
	unlikely to change significantly over its lifetime. This is in contrast to the software applications that
	manage these contents, which are ephemeral, requiring constant updating and replacement. Thus, transitions
	between application-specific methods of file management to support software upgrades and replacement cycles

	An OCFL Object is a group of one or more content bitstreams (data and metadata), and their administrative
	information that are together identified by a URI. The object may contain a sequence of versions of the
	bitstreams that represent the evolution of the object's contents.
	</p>

	[object_root]
	├── 0=ocfl_object_1.0
	├── inventory.jsonld
	├── inventory.jsonld.sha512
	├── logs
	│ └── .keep

ocfl / spec Goto Github PK

spec's Introduction

The Oxford Common File Layout

Status of this document

Latest editors draft

Communication Channels

Mailing Lists

Slack Channel

GitHub

spec's People

Contributors

Stargazers

Watchers

Forkers

spec's Issues

Best Practices

Two Specifications

Recommend Projects

Recommend Topics

Recommend Org