mjordan / islandora_workbench Goto Github PK

View Code? Open in Web Editor NEW

26.0 8.0 36.0 26.91 MB

A command-line tool for managing content in an Islandora 2 repository

License: MIT License

Python 99.87% Dockerfile 0.07% CSS 0.06%

islandora_workbench's Issues

Add function to strip all CSV values

All values should be stripped of leading and trailing white space, newlines, etc.

Non-multivalued non-entity reference fields not being added on 'create' task

Figure out a way to get a list of fields for a content type via REST

If we can inspect the fields that a content type has configured, we can validate that the columns in the input file match machine field names, and we can also dynamically create requests for string, taxo term, and linked references.

Add options for the CSV file

Delimiter, quote character, etc. Provide sensible defaults.

Add ability to use custom fields

Currently only title and description are allowed.

Image Resizing

It is useful to resize images (if that is what we are uploading) locally since 1) desktop processor time is essentially free and 2) we can make thumbnails for the GUI editor before upload.

We could use imagemagick but users need to install it on their machine before the workbench is installed. (Searching for ways to include it natively have not worked.)

We could use the Python Pillow library (the original PIL doesn't have a Python 3 release). Presumably we could wrap calls there like we do for the ingester, although I don't know the implications for building the distribution binaries.

Another option is the Node.js Sharp module which, in some testing, performs faster than an Javascript wrapper around imagemagick. The benefit of this is we know it can integrate with an Electron app. The downside is that it doesn't appear to support Jpeg2000.

Add CONTRIBUTING.md

Add a task to validate the input data

Now that we can introspect fields, we should offer a way for the user to validate their input data, e.g., all the columns in the CSV file exist, the field types are consistent with the data in the columns, and the files that are referenced exist.

Add ability to add and update non-text fields

Now with #6 coming along, we can inspect each field to see if it is a string, entity reference, etc. We should also probably know if the field is multivalued or not.

Add check of config file

To make sure all the required options, for each task, are present.

Wrap all requests in a function

This would allow centralized error handling, etc.

Multi-value Columns

Some fields allow multiple values (e.g. field_subject, cardinality set on field storage config).

One way to allow multiple values is to configure the column with a sub-delimiter. For example, the file is delimited with commas, and then a particular field can be delimited with semicolons. The example CSV below has a field 'field_subject' that takes a semicolon delimited list of term ids.

field_identifier,title,field_subject,file
example001,My Image,5;7;33,example001.tif

This example extended to support field lookups (#18) could take a semicolon delimited list of subject URIs:

field_identifier,title,field_subject,file
example001,My Image,http://id.loc.gov/authorities/subjects/sh85049588;http://id.loc.gov/authorities/genreForms/gf2017027249;http://sws.geonames.org/4013157/,example001.tif

Add check to see if all column headings are Drupal field machine names

Delete media when deleting nodes

Show media URL in output

Since we can get the URL of a newly created media in the HTTP response headers (e.g. Location: http://localhost:8000/media/4), we should show this to the user and also log it. So instead of:

Node for 'Small boats in Havana Harbour' created at http://localhost:8000/node/52.
+File media for IMG_1410.tif created.

we could show:

Node for 'Small boats in Havana Harbour' created at http://localhost:8000/node/52.
+File media for IMG_1410.tif created at http://localhost:8000/media/4.

Provide configuration option for content type

Currently it's hard coded to "islandora_object". It would make sense to make this a field in the CSV, e.g., "content_type".

Remove crufty slicing of subvalues

Looks like I introduced unnecessary slicing of subvalues in logic that deals with cardinality of field values. Specifically, if the cardinality of a field is -1 (unlimited), I'm slicing on this value, which results in removing the last item in the list.

I will take care of this once issue-20 branch is merged.

Allow fields to be applied to all new objects by using a single configuration value

If a value in a CSV field is the same for all new nodes (task = create), it might be useful to allow this field:value pair to be defined once in the configuration file. Some examples are #22, and #12. The logic would be if a field name: value pair is present in the config file, apply that value to all new nodes (task = create). We would probably need to create a nested map in the config file, e.g.:

node_fields:
  field_member_of: 12
  content_type: islandora_object

In this example, all of the newly created nodes would be islandora_object nodes with a field_member_of value of 12.

Check that each row contains the same number of columns as there are headers is not working

--check doesn't detect that some rows in the CSV don't have the same number of columns as there are CSV headers.

Allow different ways to "update" field values

Updates are complicated. Currently, updates in workbench "preserve any values in the fields, they don't replace the values." It would be useful to allow for this type of update, as well as updates that replace existing field values with ones in the input data, delete field values (#39), and remove a specific value from a multivalued field (#40).

It would also be useful to allow users to specify at the row level, rather than for all rows in the input CSV, which of these types of "update" is intended. One option (not necessarily the best) is to have the user provide a flag as the first character in the field itself that indicates which type of update they want. Typed relation fields already use a structured value using : to separate the parts of the value, so an approach like that might be applicable. For example, the user specifies an a for "append input values", an r for "replace all values with input", a d for "delete all existing values or a specific value" like this:

filed_my_field
a:additional value
d:
d:bad_value
r:new value

This would have to work with multivalued fields, and also with typed relation fields, which already use the :.

If the user didn't want row-level granularity for a given job (i.e., all update operations were of the same kind), the flag should be configurable in the config file.

Spreadsheet Editor

For bulk-editing/creating records our staff would like a spreadsheet-like experience with one row per record. Fields with multiple values (e.g. subjects) will need a delimiter. (Our staff are used to the CONTENTdm's semicolon delimiter convention although Drupal's entity autocomplete uses commas.)

There are a number of options out there, as noted by Fancy Grid's awesome-grid, so the question becomes "which one?"

A few features that we would like to see include:

"fill down" option for quickly duplicating row values
"read only" cells (so we don't accidentally change node ids when doing updates)
image display support for showing thumbnails for the item the row corresponds to
auto-complete (admittedly, we probably won't get this for free, but some sort of plugin support so we can build our own)
auto-complete/drop-down support for multiple values
grouping (for complex objects)

Early on I was keen on Handsontable, but they recently changed from an MIT license to a commercial license as of version 7.x. I've received confirmation from their sales team that we could use their "non-commercial license key", however we would need to be very clear that the workbench is not 100% open source and can't be re-released commercially without paying. We could take the Sakai approach and simply fork their 6.2.2 release (the last using MIT), or rely on Sakai's fork.

It may be better to just stick with another open source project. Options I'm considering are DataTables and x-spreadsheet.

Provide option to validate transmission fixity

Just an idea, but related to Islandora/documentation#867:

Workbench could verify transmission fixity by generating a digest for a file prior to PUTing it, then, once the file has been ingested, getting its checksum via the technique implemented in mjordan/islandora_riprap#33 implemented in the Islandora Workbench Integration module.

Include a setup script

Be sure to test on Windows.

In the update, delete, and add_media tasks, check to see if node wiht ID exists

Add ability to update nodes

Add ability to replace file

Add ability to delete nodes

From an input file containing node IDs or URLs.

Relocate check for 'location' header

Currently, the check for the location HTTP response header is outside of the block that checks for a successful create (201) response:

node_response = issue_request(config, 'POST', node_endpoint, node_headers, node, None)
                node_uri = node_response.headers['location']
                if node_response.status_code == 201:
                    print("Node for '" + row['title'] +
                          "' created at " + node_uri + ".")
                    logging.info("Node for %s created at %s.", row['title'], node_uri)

The check for 'location' should go within this block.

keyError during task 'update'

Getting a keyError exception when running the update task. Working on it.

Provide a way to add a media to nodes

Add check to make sure each CSV row has same number of columns as there are headers

Move media_use_tid from global configuration options to CSV values

Need a functional test framework

I haven't evaluated the popular test frameworks for Python, but given the growing complexity of workbench's code and the various combinations of field types, cardinality, etc., small changes could easily lead to regression errors. Since the main functionality of this application is interacting with Drupal via REST, any framework we use would need to be able to support mock response objects, etc. (or whatever Python calls those).

So far while hacking on workbench, I have always had a live CLAW Vagrant running. I know it's not good practice to rely on external systems in automated testing, but I'd be perfectly happy to require a vagrant as part of a test framework. In fact, our functional tests in Islandora 7.x worked against a live Islandora. That would complicate CI however.

The testing pattern I'm imagining is:

within a test, workbench makes request to a live Islandora at localhost using fixture data. For POSTing new nodes, this means a new node is created on the vagrant.
response is evaluated, or a subsequent GET is issued to get data to evaluate test
if the test is successful, the node is then DELETED or updated (and then deleted) if necessary to keep the local Islandora instance clean

Again, I know purists will say "you should always mock up your responses" but I'm more of a pragmatist than a purist.

Validate typed relation values

We need to validate that the relations are present in the list configured in the field config, and we also need to validate that the target ID exists in the linked taxonomy. Otherwise, Drupal returns a 422 HTTP status code.

Allow for empty file values

We should allow for empty file field values, so that nodes can be created without attached media.

Add support for the tus resumable upload protocol

@kayakr points out over at Islandora/documentation#1172 that there is a Drupal contrib module for tus. There is also a tus client for Python.

Add more logging

Provide a delete_media task

To complement 'add_media'. Should be done in conjunction with #38 so we can reuse the delete code.

Refactor for better code organization, testability

workbench is now over 500 lines of code and growing. It should be refactored so it is easier to read and develop in, and so that automated tests can be applied.

Figure out what happens when the number of values added to a field exceeds the field's cardinality

Most fields defined by Islandora Defaults either have a cardinality of 1 or -1 (unlimited). If a user attempts to add a number of values to a field (see #19) that exceeds the field's cardinatlity, what happens? Does Drupal thow an exception? Are the excess values ignored?

In any case, we probably want to log the fact that the number of values being added exceeds the maximum allowed.

Validate that files exist

Provide ability to remove a specific value from a multivalued field

Updating is complex. 😄

Node reference lookup field

We may have spreadsheet rows that need to reference other rows, so we need to be able to provide a look-up value.

For example, I have a parent record and two child records, all of which use the field_identifier column. I also have a field_member_of column (could be field_part_of, etc.). The child records don't have a node id for the parent record (because it hasn't been created yet) but it does have an identifier value.

field_identifier,field_member_of,title,file
example001,,Parent record,
example001-001,example001,Child one,example001-001.tif
example001-002,example001,Child two,example001-002.tif

We should be able to tell the workbench that a certain column (e.g. field_member_of) should look up the node id based on a configured node field (e.g. field_identifier).

Set up defaults for some config settings

delimiter defaults to , if not specified in the config file.

Provide ability to delete field values

Currently there is no way to remove all values from a field. This should probably be a specifc case within the "update" task. Also, in the update task, allow users to replace existing values, not append to them (as happens now). Maybe the latter could be a two-step process where the user deletes all values and then updates with new ones?

Support Linked Agents (TypedRelation Fields)

Islandora Defaults uses the TypedRelations FieldType for field_linked_agent. This means that a valid value for a field_linked_agent field includes both the relator (e.g. 'relators:pht' for a Photographer) and the Photographer's term id.

Also, assuming we implement multi-valued columns (#19), we could have a cell listing the relator code/term id pairs delimited by semi-colons.

Figure out how to determine whether a field is required

Strangely, http://admin:islandora@localhost:8000/jsonapi/field_storage_config/field_storage_config does not appear to indicate whether a field is required or not. This information would be invaluable since if we have it, we can warn the user that they have omitted a required field. If a field is required, and it is not present in the CSV file, Drupal does not create the node.

mjordan / islandora_workbench Goto Github PK

islandora_workbench's Issues

Recommend Projects

Recommend Topics

Recommend Org