ocfl / use-cases Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 0.0 5 KB

A repository to help capture, track, and discuss use cases for OCFL. Issues-only, please.

use-cases's People

Contributors

Stargazers

Watchers

use-cases's Issues

Support multiple services logging to an object

An institution wishes to implement a comprehensive automated digital preservation auditing system. This system might be composed of several services that provides fixity checks, format validation, and sanity checking on the filesystem layout. These auditing services will maintain a log within the object itself. A separate log for the entire filesystem may also be maintained.

OCFL as a formal standard?

OCFL is an informal standard. It is openly available, openly licensed (CC BY 4.0), and archived in Zenodo. Are the implementation scenarios where OCFL being a formal standard would be important? And, if so, what formality is required?

Format shifts

An institution has decided that it will undertake a mass conversion of its image files from TIFF to JPEG 2000. They implement an object migration client that reads in the old files, converts them, creates a new version of the object, and stores the new files. Their client also performs a check that the conversion is lossless and validates that the JPEG 2000 is correctly encoded using Jpylyzer. A record of the format shift and the results of the validation are stored in the object's audit logs.

Compatibility with direct-attached and NFS-attached disks

The OCFL spec should cover the case of users having direct-attached or NFS/CIFS attached disks.

Use OCFL as wrapper for different underlying digital objects

e.g. allow Bagit and Moab objects to co-exist in a repository with a single inventory/manifest without needing to migrate Bagit objects into Moab.

Support segmented file storage

Few filesystems and object stores work well with very large files (e.g. multi-terabyte) and the usual approach is to segment very large files into chunks for easier storage, transfer and fixity checking. Although one can store a set of segments in an OCFL v1 object, there is no support for understanding that a set of segments combine to make one logical file.

Flagging file loss/corruption

A periodic audit of a filesystem has revealed that a PDF file no longer matches its checksum, and will no longer open in a PDF reader. Checksums should be used to flag this as a problem and alert a validation client that the object is no longer valid.

Support in-place migration of existing digital objects to OCFL objects

The spec should be written to support the migration of an existing storage root that contains objects that can be converted to OCFL format. This will be an in-place, gradual migration where there will be a period of time where a storage root will be declared to conform to the OCFL spec, but contains objects that are not yet OCFL-compliant alongside objects that are compliant.

For example, Stanford has many storage roots with Moab-based digital objects; forklift migration of this content is problematic. It would be desirable to declare an existing Moab storage root as OCFL-compliant, and convert the Moabs to OCFL over time.

Storing 100TB of content in S3

As an implementer of an OCFL repository, I want to store ~100TB of content, spread across hundreds of thousands of objects, in AWS S3.

Reference:

Other services should be able to interact with OCFL filesystems

An institution has a service (e.g., a IIIF image server) that requires direct access to image files.

Version reversion and deletion for Fedora API compatibility

An institution wishes to implement the Fedora4 API using an OCFL file system for file storage. In order to maintain compatibility with the Fedora4 API, it needs to support the following features: reversion of a version, removal of a previous version.

Support physical file-level deletion

There are legitimate curatorial reasons for being able to physically remove individual files from an object. Right now, the only way to deal with this is through the Purge procedure outlined in the Implementation notes. This requires deleting the entire object and then re-creating it without the implicated files. It would be useful to work with the OCFL community to create an easier way to do this in a more automated manner that would rewrite inventories and perhaps leave a tombstone someone, either in the directory structure or just as metadata.

Digital repository software migrations

An institution has decided they wish to switch from one IR system to another. The old IR system does not support OCFL, so they write an exporter that translates the digital objects to an OCFL filesystem. Their new IR does understand OCFL, so once the data is exported they configure their IR to point at the OCFL roots. The new IR crawls the OCFL filesystem and indexes the object metadata into a Solr index for faster lookups.

Large Inventory.json files

As noted in OCFL/spec#642, 'inventory.json' files can become large if the OCFL object has many versions or has many files or both. The result of this can be degradation of performance. The performance impact can be acute if the managing application relies on retrieving the inventory.json over the network (e.g. OCFL in S3). Additionally, parsing the inventory.json may also become a bottleneck.

Potential solutions to the issue of large inventory.json files are described in:

OCFL/spec#642 (prospective) and
#46 (retrospective)

Single-file OCFL object storage (e.g., Tar, Zip)

A multinational astronomical research initiative has several terabyte-sized datasets that it wishes to make available to researchers around the world. These datasets are published in 1 TB-sized files, and so their server filesystem is optimized for very large-sized file storage. Their OCFL Objects are stored as ZIP files to help reduce the number of small files on their storage system. They implement an OCFL server that is able to use the ZIP file header to seek within a file and extract a particular file with low overhead, effectively providing 'directory-like' lookups.

Collapsing OCFL Object Versions

In the case where versions of an OCFL object have been created that are not considered curatorially significant, there are times where it would be useful to have OCFL support in collapsing those versions. For example in the object below, if the versions 3-6 are considered to be one curatorially significant version of the object...

[object root]
    ├── 0=ocfl_object_1.1
    ├── inventory.json
    ├── inventory.json.sha512
    ├── v1
    │   ├── inventory.json
    │   └── ...
    ├── v2
    │   ├── inventory.json
    │   └── ...
    ├── v3
    │   ├── inventory.json
    │   └── ...
    ├── v4
    │   ├── inventory.json
    │   └── ...
    ├── v5
    │   ├── inventory.json
    │   └── ...
    ├── v6
    │   ├── inventory.json
    │   └── ...
    ├── v7
    │   ├── inventory.json
    │   └── ...
    └── v8
        ├── inventory.json
        └── ...

It would be helpful to be able to collapse those versions, such as:

[object root]
    ├── 0=ocfl_object_1.1
    ├── inventory.json
    ├── inventory.json.sha512
    ├── v1
    │   ├── inventory.json
    │   └── ...
    ├── v2
    │   ├── inventory.json
    │   └── ...
    ├── v3 (contains collapsed result of previous versions 3-6)
    │   ├── inventory.json
    │   └── ...
    ├── v4 (previous v7)
    │   ├── inventory.json
    │   └── ...
    └── v5 (previous v8)
        ├── inventory.json
        └── ...

Object versioning

A large research library is digitising its Medieval manuscript holdings. They receive a report that a digitised book has a duplicated image for one of the pages, and realise that the photographer took two images of the same page, and none of the missing page. They re-shoot the missing page. The new page is inserted into the digital object, creating a new version of it.

Client mismatch

A particularly poorly-written OCFL library was not built to observe the OCFL version stored in the root of the filesystem, and has written an object to the filesystem using the older specification. An audit has revealed several objects that contain a mismatch between that and the root OCFL version. The objects are deleted from the filesystem, and new versions are created that conform to the specification. A record of these transactions is recorded in the root log of the OCFL filesystem.

Support notion of logical file that is stored in multiple parts in order to handle very large files

Some institutions may have very large files that are inconvenient or impossible to store as single files within and OCFL digital object. It would always be possible to split files into multiple parts in a way that each part if treated as a first class file by OCFL, but that pushes the modeling/support burden onto the application. However, an OCFL model/convention for multipart files would allow the development of shared tooling to handle large files.

Package per version storage

In cases where there are many small files in an object or where the storage infrastructure is not efficient at handling many files, it is useful to package files using a technology such as ZIP. This is addressed for the whole object in #10. However, packaging the whole object as a ZIP/Tar etc. breaks the idea of immutability of version data. One could instead package the inventory and content for each new version as a new ZIP file.

Combined access and preservation storage

An institution decides that it will use the same file storage architecture for both access and preservation copies. Repository software operates on the underlying filesystem itself for both routine access (e.g., a request for an image from a IIIF server) as well as making periodic copies to nearline storage and tape backup systems. In this case, they may implement the OCFL server access pattern, differentiating between systems that provide read/write and read-only access.

Separate access and preservation storage

An institution decides that it will employ an institutional repository system that does not implement OCFL, but that it wishes to use OCFL for long-term preservation. They implement OCFL on their long-term nearline storage (e.g., Amazon Glacier) as well as the preferred layout for storing to tape. They may employ the Export-only access pattern to dump changes periodically to an OCFL storage system, and then implement a server-agnostic OCFL client to upload and update digital objects on an S3 system.

OCFL Object Forking

In Zenodo we have a use case where we have two layers of versioning. A user can publish a dataset on Zenodo which will get a DOI. A new version of the dataset can be published by the user, which will get it a new DOI. This way a DOI always point to a locked set of digital files. Occasionally, however, we have the need to change files of an already published dataset with a DOI (e.g. user accidentally included personal data in the dataset and discovered 2 months later). Essentially this means we have two layers of versioning in Zenodo, which I'll call

Versioning (each version get's a new DOI - at the repository level each version is a separate record)
Revisions (edits to a single version - at the repository level this a single record).

In the Zenodo of case, our need for deduplication is essentially between versions, because' that's where a user may only add 1GB to a 100TB dataset.

They way we have thought about mapping Zenodo to OCFL is that each DOI is essentially an OCFL object. Because OCFL object only supports deduplication within an OCFL object, and not between OCFL objects, nor does OCFL allow symlinks, then we cannot do this deduplication.

Example

Imagine these actions:

Publish first version 10.5281/zenodo.1234 with two very large (let's just say 100TB to exaggerate) files: data-01.zip and mishap.zip
Publish new version 10.5281/zenodo.4321 with one new file: data-02.zip (files is thus: data-01.zip and data-02.zip).
Remove mishap.zip from 10.5281/zenodo.1234

The OCFL objects would be:

[10.5821/zenodo.1234]
    ├── 0=ocfl_object_1.0
    ├── inventory.json
    ├── inventory.json.sha512
    ├── v1
    │   ├── inventory.json
    │   ├── inventory.json.sha512
    │   └── content
    │       ├── data-01.zip
    │       └── mishap.zip
    └── v2
        ├── inventory.json
        ├── inventory.json.sha512
        └── content


[10.5821/zenodo.4321]
    ├── 0=ocfl_object_1.0
    ├── inventory.json
    ├── inventory.json.sha512
    └── v1
        ├── inventory.json
        ├── inventory.json.sha512
        └── content
            ├── data-01.zip (duplicatied 100TB of data!!!)
            └── data-02.zip

What I would like is not having to duplicate data-01.zip in 10.5821/zenodo.4321 OCFL object?

Is there a solution for this in OCFL, or a different way to construct our OCFL objects that could support this?

Support multiple files with same content in one version of an object

A repository that archives arbitrary content may include versions of objects that include the same content in two or more files. A special case of this is zero size files. While files with the same content may not occur as a result of controlled digitization workflows, it is easy to imagine cases were user supplied repository content (e.g. from an institutional repository, from arXiv.org, or from a disk image) would include multiple files with identical content (and hence identical digest).

Allow past version file retrievals

A researcher has cited a URL to an older version of a digital object in a research article. The institution has implemented the Memento standard, allowing the researcher's citation to refer to the previous version of the digital object, instead of the newest version. The OCFL client presents the files representing the previous version of the object on request.

Compatibility with Object Stores (e.g., S3)

The specification should cover the use of OCFL on an object store, e.g., Amazon S3.

Partial updates or additions

A large collection of digitized material is being generated, with the desire to archive it as it goes. The archivist wishes to add new files and update existing files without having the entire collection on hand at once. The OCFL client should be able create new versions of an object based solely on the updates and additions.

Support validation of all objects under a storage root according to a particular object disposition convention

Client code should be able to validate that objects under a storage root conform to a particular object disposition convention, and potentially that no other files exist under the storage root. The convention may be a common/shared one (such as pairtree where that path as constructed from the director path corresponds to the OCFL object id) or a local convention. A goal of declaring the structure is to help make the storage root self-documenting.

Object transfer

An organization may wish to transfer a digital object to another organization's storage system. This might be the case in distributed newspaper digitisation initiatives, for example, where individual institutions digitise their collections, but then send them to a central library for access and preservation purposes. In this case, existing standards such as BagIt can be used to package and transfer OCFL objects. An OCFL object, including any metadata about object versions, may be transferred within the BagIt data directory.

EPrints-Archivematica Export Structure Compatibility

I apologize in advance if this is not the best place to raise this question - if that is the case, please direct me to the more appropriate place (I did notice that there are a number of different email lists and a slack channel for this group).

I was introduced to OCFL at OR2018, and I immediately saw the potential to have this inform something that I am working on as well as be a bridge across repository systems. At the same OR2018, I co-presented a proposal for an export format for EPrints-to-Archivematica, for preservation. This format uses a folder stucture, and ideally, it would be optimal if this folder structure was compatible with the OCFL.

Here are the details of the proposal: https://spectrum.library.concordia.ca/983933/

Right away, I see two places where there is a divergence between that and OCFL, and I want to explore/discuss it:

The last modified date is placed right into the folder name of the top level object in our proposal. This also means that the entire object is replicated whenever any modification is made. This is not efficient in terms of storage space, but it has its own advantages of clarity and ease of retrieval later on. The OCFL uses a sequential "version 1...x" folder with changed files only.
In our proposal, BagIt is used for creating manifests - whereas in OCFL uses the inventory.jsonld format for this.

I suppose that I am looking to understand the reasoning behind OCFL's choices, and if these are compelling, possibly modify my proposal/plan.

Corruption Recovery

A power outage occurred when a software component was in the middle of writing an OCFL object, leaving the object in an ambiguous state. There should be mechanisms for recovering from various failure modes.

Recording correctly ingested bad data

A file is included in a version of an OCFL object that is in some way corrupted (maybe a bad PDF or a badly encoded JPEG2000) although its digest is correct in the OCFL record so the OCFL object itself is entirely valid.

Rebuildability

An institution has had a flood event in its data centre, and this has wiped out the disk arrays that power its institutional repository website. However, it has maintained a Git repository of its software, and has been able to recover its most recent backups from an off-site tape storage. They restore the tapes and upload the OCFL backups to an S3 account. They adjust the settings in their software to point to the new storage system. Their institutional repository repopulates its internal database with metadata from the OCFL objects, and they are back online within a few weeks, with all object content and provenance intact.

Distributed Storage of objects and components

There are a number of use cases that indicate the need to have different parts of an object stored in a variety of places.

Support file renaming between versions

It should be possible to have files with the same content but different names in different versions of an OCFL object. In essence this is part of a requirement that the content of one version of an OCFL object should not constrain the content of any other version of an OCFL object (though overlapping content may lead to storage efficiencies).

Use external identifiers in place of digests to act as keys within the manifest

Some preservation systems assign globally unique ids to content files. These could be used as keys within and OCFL Object manifest to link metadata, state and fixity information together, in place of the current use of a digest. Fixity would then only be in the fixity section which would then be mandatory.

Support file locking and multiple services working on same object

In an archival repository based on OCFL there are multiple processes handling ingest, update (new versions), format validation (with ever improving tools that cover more formats as time goes on), and fixity checking. All of these processes may write to some part of an OCFL object (either the data itself or the logs).

Is some OCFL-object locking mechanism required to avoid collisions or race conditions? If so, is that in or out of scope for the spec?

Defining a repository from peer storage roots

[Moved from the spec issues repository as this describes a new use case of handling multiple storage roots making up one repository. It includes both the aggregation of content in multiple storage roots and possibly replication of content.]

This may be a part of issue OCFL/spec#22 and it certainly follows on from the comment.

My institution can't provide a single 200TB volume (!). But they can give me 2 x 70TB and a 60TB volume. So for my use case I now need to have 3 OCFL filesystems that I interact with as a single unit from my service.

Given this, it would be nice to be able to define metadata at the repository level that says this filesystem is a part of a larger set of peers. Nice to haves would include defining a priority for each peer and perhaps the storage tier. That way, clients can make smart decisions about ranking peers by tier and then priority (I imagine these are properties defined by the administrators provisioning the storage).

The justification for this is that any connecting service or user inspecting the filesystem can identify that it is part of a larger set.

For example - a storage.json or some such with content like:

{
  peers: [
    { 
       type: 'filesystem',
       mountpoint: /mnt/ocfl-repo1
       priority: 1,
       tier:  'hot'
    },
    { 
       type: 'filesystem',
       mountpoint: /mnt/ocfl-repo2
       priority: 2,
       tier:  'cold'
    },
    { 
       type: 's3'
       endpointUrl: undefined (means aws S3) or URL (means something like a local minio instance),
       forcePathStyle: true, false or undefined (=false) (required for minio),
       priority: 2,
       tier:  'warm'
    },
    { 
       type: 'filesystem',
       mountpoint: /mnt/ocfl-repo3,
       priority: 1,
       tier:  'hot'
    },
  ]
}

In this model priority can be any sequential number and class could be 'hot', 'warm', 'cold' to dovetail with typical nomenclature used in the industry.

Individual version directories can be stored external to the OCFL object

A library has limited on-site storage and is using an external provider with an object store for hosting the majority of its bitstreams. The OCFL Object inventory should be able to accommodate multiple remote storage locations for a version directory to allow for multi-site, off-site storage with the on-site inventory being used for local lookups.

There can be multiple storage roots, and these can be a mixture of both local and remote.

Related to #10

See: OCFL/spec#22

Adding data to inventory.json

This is for collecting use-cases where a user might want to add some data to inventory.json. In OCFL 1.0, inventory.json is completely locked down, with no possibility of adding an extra key or extra data (even with an extension).

Note: some or all of the use cases here have other options besides using inventory.json. It is possible to use inventory.json as defined in 1.0, and use other options for these use-cases. However, that does not mean that there isn't interest from users in adding data to inventory.json, and some might choose to do so if that's an option.

Mime Types

I am writing an OCFL HTTP layer, and when serving out a file, I want to serve it with the correct Content-Type header. This HTTP layer could be used by different repositories that store technical metadata in different files and formats. One way to have a standard location for mime types would be to define a "mimetypes" key under an "extensions" key in inventory.json, and store the mime type for each file (as identified by its checksum).

Related Note: we are coming from Fedora 3, where the mime type is stored in the FOXML, not in a datastream. Of course we can put the mime type in a content file if we choose to, but using inventory.json would have some similarities to the FOXML from Fedora 3.

IU tape preservation

See OCFL/spec#474 (comment), OCFL/spec#474 (comment), OCFL/spec#474 (comment) for @bdwheele comments about adding data to inventory.json.

Suggestion

Open up the spec to allow extensions to define data that can be placed in the "extensions" object in inventory.json. Any validation tools must ignore the "extensions" data if they don't understand it, but validate the data if they do. So a basic OCFL validation tool that doesn't support any extensions would make sure inventory.json parses, validate all the spec keys, ignore the "extensions" data, and report the object as valid.

Any extensions that are submitted can be analyzed via the extension process to see if they would be a good fit for inventory.json.

Deleting a file from all versions of a OCFL object

First off, congratulations to the OCFL editors on the release of v1.0!

Let's imagine a scenario in which a file had to be deleted from all versions of a OCFL object. Would a subsequent version of the OCFL specification support flagging a digest in the manifest as 'deleted' or 'expunged' so that the OCFL object can still be a validated? Is recording the file deletion in the version history of a OCFL object in scope?

I'm inclined to say that the goal of OCFL, as an archival and not an access format, is to maintain the entire history of an object and all associated files. As such, 'scrubbing' a file from an OCFL object should not be in-scope. However, as OCFL is used in more projects, situations mandating the deletion of files will inevitably happen. Tooling around OCFL objects might want to support a operation to delete files from all versions and rebuild the inventory.json files for all versions containing that file.

Support Merging Objects

As a repository manager, I have systems that operate on my objects remotely (e.g., a cloud service workflow engine can be used to transcode my TIFF images to JPEG2000). These changes would be stored as a new version.

However, I would also want to update other copies of this object with the content from the new version.

Thus, we should support merges of OCFL objects, and identify the cases where this might fail. (e.g., two systems having the same version number, but different changes).

Application Profiles

The OCFL is designed to specify the layout, inventory, and versioning of hierarchies of files targeted for long-term preservation.

The OCFL does not take a position on the types of files that are being preserved, how they are interrelated, or any other details that would be required for an application that reads/writes objects in the OCFL storage root to apply the semantics commonly required to manage content and data modeling.

The use case described here is a recognition of this gap between what the OCFL specification provides and what applications implemented over an OCFL Storage Root require.

One of the opportunities afforded by the OCFL is the potential to decouple the preservation storage from the upper-level application that manages the content within the storage. An outcome of this is that it becomes (hypothetically) possible to replace the repository management application with a different application while leaving the OCFL preservation storage in-place. However, the new management application needs to understand the conventions, metadata standards, structure, and semantics of the files within the OCFL object's content directory.

This use case highlights the need for a mechanism to (minimally) provide a hint within an OCFL Storage Root and/or OCFL Objects and/or OCFL Versions to indicate to which application profile the Objects/Versions conform. The "application profile" could be a specification document or (more aspirationally) a machine-actionable configuration.

Ability to reference a file/datastream outside of an OCFL Object

Separating this from comments with #27. The file or datastream might be in another OCFL object under the same OCFL Storage Root, or might be somewhere else.

Related to #27 and OCFL/spec#22

Purging/permanent deletion of individual files

An institution accepts a dataset from a researcher containing a number of binary files. On subsequent review, one of these files contains personally identifiable medical information. The file needs to be removed permanently from the file system.

Need to use a digest that is new since the last OCFL specification update

The OCFL specification vN has been stable for some time and is widely adopted. However, there is concern over old digest algorithms and a need to use a new algorithm. How is this handled in a conforming way without requiring a specification update?

Preserve file mimetype

For digital preservation purposes, and also for compatibility with the Fedora 4 API, an institution may wish to preserve the mimetype of a binary file as metadata within the OCFL file system.

Support logging of preservation actions as part of the object

In my preservation repository I will conduct periodic fixity checks, periodic re-characterization and assessment of at-risk content, and possible migration or derivative generation actions that create new versions. I want to be able to record the results of the checks and actions in a way that lives on as part of the object.