mjordan / islandora_workbench Goto Github PK

A command-line tool for managing content in an Islandora 2 repository

License: MIT License

Python 99.87% Dockerfile 0.07% CSS 0.06%

islandora_workbench's Introduction

Islandora Workbench

A command-line tool that allows creation, updating, and deletion of Islandora content from CSV data. Islandora Workbench is an alternative to using Drupal's built-in Migrate tools for ingesting Islandora content from CSV files. Unlike the Migrate tools, Islandora Workbench can be run anywhere - it does not need to run on the Islandora server. The Migrate tools, however, are much more flexible than Islandora Workbench, and can be extended using plugins in ways that Workbench cannot.

Note that this tool is not related in any way to the Drupal contrib module called Workbench.

Features

Allows creation of Islandora nodes and media, updating of nodes, and deletion of nodes and media from CSV files
Allows creation of paged/compound content
Can run from anywhere - it communicates with Drupal via HTTP interfaces
Provides robust data validation functionality
Supports a variety of Drupal entity field types (text, integer, term reference, typed relation, geolocation)
Can provide a CSV file template based on Drupal content type
Can use a Google Sheet or an Excel file instead of a local CSV file as input
Allows assignment of Drupal vocabulary terms using term IDs, term names, or term URIs
Allows creation of new taxonomy terms from CSV field data, including complex and hierarchical terms
Allows the assignment of URL aliases
Allows adding alt text to images
Supports transmission fixity auditing for media files
Cross platform (written in Python, tested on Linux, Mac, and Windows)
Well tested
Well documented
Provides both sensible default configuration values and rich configuration options for power users
A companion project under development, Islandora Workbench Desktop, will add a graphical user interface that enables users not familiar or comfortable with the command line to use Workbench.
Run from within a Docker container.

Documentation

Complete documentation is available.

Contributing to Workbench

Metadata, files, and Drupal configurations are, in the real world, extremly complex and varied. Testing Islandora Workbench in the wild is best way to help make it better for everyone. If you encouter a difficulty, an unexpected behavior, or Workbench crashes on you, reach out on the #islandoraworkbench Slack channel or open an issue in this Github repo.

If you have a suggestion for improving the documentation, please open an issue on this repository's queue and tag your issue "documentation".

If you want to contribute code (bug fixes, optimizations, new features, etc.), consult the developer's guide.

Using Workbench and reporting problems is the best way you can help make it better!

Current maintainer

Mark Jordan

License

islandora_workbench's People

Contributors

Stargazers

Watchers

islandora_workbench's Issues

keyError during task 'update'

Getting a keyError exception when running the update task. Working on it.

Provide option to validate transmission fixity

Just an idea, but related to Islandora/documentation#867:

Workbench could verify transmission fixity by generating a digest for a file prior to PUTing it, then, once the file has been ingested, getting its checksum via the technique implemented in mjordan/islandora_riprap#33 implemented in the Islandora Workbench Integration module.

Validate typed relation values

We need to validate that the relations are present in the list configured in the field config, and we also need to validate that the target ID exists in the linked taxonomy. Otherwise, Drupal returns a 422 HTTP status code.

Validate that files exist

Provide a way to add a media to nodes

Move media_use_tid from global configuration options to CSV values

Multi-value Columns

Some fields allow multiple values (e.g. field_subject, cardinality set on field storage config).

One way to allow multiple values is to configure the column with a sub-delimiter. For example, the file is delimited with commas, and then a particular field can be delimited with semicolons. The example CSV below has a field 'field_subject' that takes a semicolon delimited list of term ids.

field_identifier,title,field_subject,file
example001,My Image,5;7;33,example001.tif

This example extended to support field lookups (#18) could take a semicolon delimited list of subject URIs:

field_identifier,title,field_subject,file
example001,My Image,http://id.loc.gov/authorities/subjects/sh85049588;http://id.loc.gov/authorities/genreForms/gf2017027249;http://sws.geonames.org/4013157/,example001.tif

Figure out what happens when the number of values added to a field exceeds the field's cardinality

Most fields defined by Islandora Defaults either have a cardinality of 1 or -1 (unlimited). If a user attempts to add a number of values to a field (see #19) that exceeds the field's cardinatlity, what happens? Does Drupal thow an exception? Are the excess values ignored?

In any case, we probably want to log the fact that the number of values being added exceeds the maximum allowed.

Figure out how to determine whether a field is required

Strangely, http://admin:islandora@localhost:8000/jsonapi/field_storage_config/field_storage_config does not appear to indicate whether a field is required or not. This information would be invaluable since if we have it, we can warn the user that they have omitted a required field. If a field is required, and it is not present in the CSV file, Drupal does not create the node.

Non-multivalued non-entity reference fields not being added on 'create' task

Add a task to validate the input data

Now that we can introspect fields, we should offer a way for the user to validate their input data, e.g., all the columns in the CSV file exist, the field types are consistent with the data in the columns, and the files that are referenced exist.

Include a setup script

Be sure to test on Windows.

Check that each row contains the same number of columns as there are headers is not working

--check doesn't detect that some rows in the CSV don't have the same number of columns as there are CSV headers.

Add support for the tus resumable upload protocol

@kayakr points out over at Islandora/documentation#1172 that there is a Drupal contrib module for tus. There is also a tus client for Python.

Incorporate request-threads

https://github.com/requests/requests-threads

As @dannylamb said in IRC, "managing the pool with 1 thread would be good to do, though. then you're not doing all the bookkeeping for managing the batch of requests"

Figure out a way to get a list of fields for a content type via REST

If we can inspect the fields that a content type has configured, we can validate that the columns in the input file match machine field names, and we can also dynamically create requests for string, taxo term, and linked references.

Refactor for better code organization, testability

workbench is now over 500 lines of code and growing. It should be refactored so it is easier to read and develop in, and so that automated tests can be applied.

Add function to strip all CSV values

All values should be stripped of leading and trailing white space, newlines, etc.

Add options for the CSV file

Delimiter, quote character, etc. Provide sensible defaults.

Provide ability to delete field values

Currently there is no way to remove all values from a field. This should probably be a specifc case within the "update" task. Also, in the update task, allow users to replace existing values, not append to them (as happens now). Maybe the latter could be a two-step process where the user deletes all values and then updates with new ones?

Add ability to delete nodes

From an input file containing node IDs or URLs.

Remove crufty slicing of subvalues

Looks like I introduced unnecessary slicing of subvalues in logic that deals with cardinality of field values. Specifically, if the cardinality of a field is -1 (unlimited), I'm slicing on this value, which results in removing the last item in the list.

I will take care of this once issue-20 branch is merged.

Add more logging

Allow different ways to "update" field values

Updates are complicated. Currently, updates in workbench "preserve any values in the fields, they don't replace the values." It would be useful to allow for this type of update, as well as updates that replace existing field values with ones in the input data, delete field values (#39), and remove a specific value from a multivalued field (#40).

It would also be useful to allow users to specify at the row level, rather than for all rows in the input CSV, which of these types of "update" is intended. One option (not necessarily the best) is to have the user provide a flag as the first character in the field itself that indicates which type of update they want. Typed relation fields already use a structured value using : to separate the parts of the value, so an approach like that might be applicable. For example, the user specifies an a for "append input values", an r for "replace all values with input", a d for "delete all existing values or a specific value" like this:

filed_my_field
a:additional value
d:
d:bad_value
r:new value

This would have to work with multivalued fields, and also with typed relation fields, which already use the :.

If the user didn't want row-level granularity for a given job (i.e., all update operations were of the same kind), the flag should be configurable in the config file.

Image Resizing

It is useful to resize images (if that is what we are uploading) locally since 1) desktop processor time is essentially free and 2) we can make thumbnails for the GUI editor before upload.

We could use imagemagick but users need to install it on their machine before the workbench is installed. (Searching for ways to include it natively have not worked.)

We could use the Python Pillow library (the original PIL doesn't have a Python 3 release). Presumably we could wrap calls there like we do for the ingester, although I don't know the implications for building the distribution binaries.

Another option is the Node.js Sharp module which, in some testing, performs faster than an Javascript wrapper around imagemagick. The benefit of this is we know it can integrate with an Electron app. The downside is that it doesn't appear to support Jpeg2000.

Add ability to use custom fields

Currently only title and description are allowed.

Allow for empty file values

We should allow for empty file field values, so that nodes can be created without attached media.

Add ability to update nodes

Provide a delete_media task

To complement 'add_media'. Should be done in conjunction with #38 so we can reuse the delete code.

Add CONTRIBUTING.md

In update task, slicing of values in fields of limited cardinality doesn't respect existing values

This comes from #37.

In the update task, for fields of non-1 and non-unlimited cardinality, we slice the subvalues to match the cardinality. What we should be doing is making sure the existing values remain, and then slice the rest to get the new values to add.

Wrap all requests in a function

This would allow centralized error handling, etc.

Need a functional test framework

I haven't evaluated the popular test frameworks for Python, but given the growing complexity of workbench's code and the various combinations of field types, cardinality, etc., small changes could easily lead to regression errors. Since the main functionality of this application is interacting with Drupal via REST, any framework we use would need to be able to support mock response objects, etc. (or whatever Python calls those).

So far while hacking on workbench, I have always had a live CLAW Vagrant running. I know it's not good practice to rely on external systems in automated testing, but I'd be perfectly happy to require a vagrant as part of a test framework. In fact, our functional tests in Islandora 7.x worked against a live Islandora. That would complicate CI however.

The testing pattern I'm imagining is:

within a test, workbench makes request to a live Islandora at localhost using fixture data. For POSTing new nodes, this means a new node is created on the vagrant.
response is evaluated, or a subsequent GET is issued to get data to evaluate test
if the test is successful, the node is then DELETED or updated (and then deleted) if necessary to keep the local Islandora instance clean

Again, I know purists will say "you should always mock up your responses" but I'm more of a pragmatist than a purist.

Add ability to replace file

Allow fields to be applied to all new objects by using a single configuration value

If a value in a CSV field is the same for all new nodes (task = create), it might be useful to allow this field:value pair to be defined once in the configuration file. Some examples are #22, and #12. The logic would be if a field name: value pair is present in the config file, apply that value to all new nodes (task = create). We would probably need to create a nested map in the config file, e.g.:

node_fields:
  field_member_of: 12
  content_type: islandora_object

In this example, all of the newly created nodes would be islandora_object nodes with a field_member_of value of 12.

Relocate check for 'location' header

Currently, the check for the location HTTP response header is outside of the block that checks for a successful create (201) response:

node_response = issue_request(config, 'POST', node_endpoint, node_headers, node, None)
                node_uri = node_response.headers['location']
                if node_response.status_code == 201:
                    print("Node for '" + row['title'] +
                          "' created at " + node_uri + ".")
                    logging.info("Node for %s created at %s.", row['title'], node_uri)

The check for 'location' should go within this block.

Show media URL in output

Since we can get the URL of a newly created media in the HTTP response headers (e.g. Location: http://localhost:8000/media/4), we should show this to the user and also log it. So instead of:

Node for 'Small boats in Havana Harbour' created at http://localhost:8000/node/52.
+File media for IMG_1410.tif created.

we could show:

Node for 'Small boats in Havana Harbour' created at http://localhost:8000/node/52.
+File media for IMG_1410.tif created at http://localhost:8000/media/4.

Node reference lookup field

We may have spreadsheet rows that need to reference other rows, so we need to be able to provide a look-up value.

For example, I have a parent record and two child records, all of which use the field_identifier column. I also have a field_member_of column (could be field_part_of, etc.). The child records don't have a node id for the parent record (because it hasn't been created yet) but it does have an identifier value.

field_identifier,field_member_of,title,file
example001,,Parent record,
example001-001,example001,Child one,example001-001.tif
example001-002,example001,Child two,example001-002.tif

We should be able to tell the workbench that a certain column (e.g. field_member_of) should look up the node id based on a configured node field (e.g. field_identifier).

Add ability to add and update non-text fields

Now with #6 coming along, we can inspect each field to see if it is a string, entity reference, etc. We should also probably know if the field is multivalued or not.

In the update, delete, and add_media tasks, check to see if node wiht ID exists

Provide ability to remove a specific value from a multivalued field

Updating is complex. 😄

Add check to see if all column headings are Drupal field machine names

Support Linked Agents (TypedRelation Fields)

Islandora Defaults uses the TypedRelations FieldType for field_linked_agent. This means that a valid value for a field_linked_agent field includes both the relator (e.g. 'relators:pht' for a Photographer) and the Photographer's term id.

Also, assuming we implement multi-valued columns (#19), we could have a cell listing the relator code/term id pairs delimited by semi-colons.

Provide configuration option for content type

Currently it's hard coded to "islandora_object". It would make sense to make this a field in the CSV, e.g., "content_type".

Add check to make sure each CSV row has same number of columns as there are headers

Spreadsheet Editor

For bulk-editing/creating records our staff would like a spreadsheet-like experience with one row per record. Fields with multiple values (e.g. subjects) will need a delimiter. (Our staff are used to the CONTENTdm's semicolon delimiter convention although Drupal's entity autocomplete uses commas.)

There are a number of options out there, as noted by Fancy Grid's awesome-grid, so the question becomes "which one?"

A few features that we would like to see include:

"fill down" option for quickly duplicating row values
"read only" cells (so we don't accidentally change node ids when doing updates)
image display support for showing thumbnails for the item the row corresponds to
auto-complete (admittedly, we probably won't get this for free, but some sort of plugin support so we can build our own)
auto-complete/drop-down support for multiple values
grouping (for complex objects)

Early on I was keen on Handsontable, but they recently changed from an MIT license to a commercial license as of version 7.x. I've received confirmation from their sales team that we could use their "non-commercial license key", however we would need to be very clear that the workbench is not 100% open source and can't be re-released commercially without paying. We could take the Sakai approach and simply fork their 6.2.2 release (the last using MIT), or rely on Sakai's fork.

It may be better to just stick with another open source project. Options I'm considering are DataTables and x-spreadsheet.