Giter Club home page Giter Club logo

islandora_workbench's Introduction

Islandora Workbench

A command-line tool that allows creation, updating, and deletion of Islandora content from CSV data. Islandora Workbench is an alternative to using Drupal's built-in Migrate tools for ingesting Islandora content from CSV files. Unlike the Migrate tools, Islandora Workbench can be run anywhere - it does not need to run on the Islandora server. The Migrate tools, however, are much more flexible than Islandora Workbench, and can be extended using plugins in ways that Workbench cannot.

Note that this tool is not related in any way to the Drupal contrib module called Workbench.

Features

  • Allows creation of Islandora nodes and media, updating of nodes, and deletion of nodes and media from CSV files
  • Allows creation of paged/compound content
  • Can run from anywhere - it communicates with Drupal via HTTP interfaces
  • Provides robust data validation functionality
  • Supports a variety of Drupal entity field types (text, integer, term reference, typed relation, geolocation)
  • Can provide a CSV file template based on Drupal content type
  • Can use a Google Sheet or an Excel file instead of a local CSV file as input
  • Allows assignment of Drupal vocabulary terms using term IDs, term names, or term URIs
  • Allows creation of new taxonomy terms from CSV field data, including complex and hierarchical terms
  • Allows the assignment of URL aliases
  • Allows adding alt text to images
  • Supports transmission fixity auditing for media files
  • Cross platform (written in Python, tested on Linux, Mac, and Windows)
  • Well tested
  • Well documented
  • Provides both sensible default configuration values and rich configuration options for power users
  • A companion project under development, Islandora Workbench Desktop, will add a graphical user interface that enables users not familiar or comfortable with the command line to use Workbench.
  • Run from within a Docker container.

Documentation

Complete documentation is available.

Contributing to Workbench

Metadata, files, and Drupal configurations are, in the real world, extremly complex and varied. Testing Islandora Workbench in the wild is best way to help make it better for everyone. If you encouter a difficulty, an unexpected behavior, or Workbench crashes on you, reach out on the #islandoraworkbench Slack channel or open an issue in this Github repo.

If you have a suggestion for improving the documentation, please open an issue on this repository's queue and tag your issue "documentation".

If you want to contribute code (bug fixes, optimizations, new features, etc.), consult the developer's guide.

Using Workbench and reporting problems is the best way you can help make it better!

Current maintainer

Mark Jordan

License

License: MIT

islandora_workbench's People

Contributors

ajstanley avatar alxp avatar aoelschlager avatar cclauss avatar dependabot[bot] avatar donrichards avatar gii2000 avatar hassanelsheikha avatar jefferya avatar joecorall avatar manez avatar mjordan avatar noahwsmith avatar rosiel avatar ruebot avatar seth-shaw-asu avatar seth-shaw-unlv avatar willtp87 avatar ysuarez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

islandora_workbench's Issues

Validate typed relation values

We need to validate that the relations are present in the list configured in the field config, and we also need to validate that the target ID exists in the linked taxonomy. Otherwise, Drupal returns a 422 HTTP status code.

Multi-value Columns

Some fields allow multiple values (e.g. field_subject, cardinality set on field storage config).

One way to allow multiple values is to configure the column with a sub-delimiter. For example, the file is delimited with commas, and then a particular field can be delimited with semicolons. The example CSV below has a field 'field_subject' that takes a semicolon delimited list of term ids.

field_identifier,title,field_subject,file
example001,My Image,5;7;33,example001.tif

This example extended to support field lookups (#18) could take a semicolon delimited list of subject URIs:

field_identifier,title,field_subject,file
example001,My Image,http://id.loc.gov/authorities/subjects/sh85049588;http://id.loc.gov/authorities/genreForms/gf2017027249;http://sws.geonames.org/4013157/,example001.tif

Figure out how to determine whether a field is required

Strangely, http://admin:islandora@localhost:8000/jsonapi/field_storage_config/field_storage_config does not appear to indicate whether a field is required or not. This information would be invaluable since if we have it, we can warn the user that they have omitted a required field. If a field is required, and it is not present in the CSV file, Drupal does not create the node.

Provide ability to delete field values

Currently there is no way to remove all values from a field. This should probably be a specifc case within the "update" task. Also, in the update task, allow users to replace existing values, not append to them (as happens now). Maybe the latter could be a two-step process where the user deletes all values and then updates with new ones?

Remove crufty slicing of subvalues

Looks like I introduced unnecessary slicing of subvalues in logic that deals with cardinality of field values. Specifically, if the cardinality of a field is -1 (unlimited), I'm slicing on this value, which results in removing the last item in the list.

I will take care of this once issue-20 branch is merged.

Allow different ways to "update" field values

Updates are complicated. Currently, updates in workbench "preserve any values in the fields, they don't replace the values." It would be useful to allow for this type of update, as well as updates that replace existing field values with ones in the input data, delete field values (#39), and remove a specific value from a multivalued field (#40).

It would also be useful to allow users to specify at the row level, rather than for all rows in the input CSV, which of these types of "update" is intended. One option (not necessarily the best) is to have the user provide a flag as the first character in the field itself that indicates which type of update they want. Typed relation fields already use a structured value using : to separate the parts of the value, so an approach like that might be applicable. For example, the user specifies an a for "append input values", an r for "replace all values with input", a d for "delete all existing values or a specific value" like this:

filed_my_field
a:additional value
d:
d:bad_value
r:new value

This would have to work with multivalued fields, and also with typed relation fields, which already use the :.

If the user didn't want row-level granularity for a given job (i.e., all update operations were of the same kind), the flag should be configurable in the config file.

Image Resizing

It is useful to resize images (if that is what we are uploading) locally since 1) desktop processor time is essentially free and 2) we can make thumbnails for the GUI editor before upload.

We could use imagemagick but users need to install it on their machine before the workbench is installed. (Searching for ways to include it natively have not worked.)

We could use the Python Pillow library (the original PIL doesn't have a Python 3 release). Presumably we could wrap calls there like we do for the ingester, although I don't know the implications for building the distribution binaries.

Another option is the Node.js Sharp module which, in some testing, performs faster than an Javascript wrapper around imagemagick. The benefit of this is we know it can integrate with an Electron app. The downside is that it doesn't appear to support Jpeg2000.

Need a functional test framework

I haven't evaluated the popular test frameworks for Python, but given the growing complexity of workbench's code and the various combinations of field types, cardinality, etc., small changes could easily lead to regression errors. Since the main functionality of this application is interacting with Drupal via REST, any framework we use would need to be able to support mock response objects, etc. (or whatever Python calls those).

So far while hacking on workbench, I have always had a live CLAW Vagrant running. I know it's not good practice to rely on external systems in automated testing, but I'd be perfectly happy to require a vagrant as part of a test framework. In fact, our functional tests in Islandora 7.x worked against a live Islandora. That would complicate CI however.

The testing pattern I'm imagining is:

  1. within a test, workbench makes request to a live Islandora at localhost using fixture data. For POSTing new nodes, this means a new node is created on the vagrant.
  2. response is evaluated, or a subsequent GET is issued to get data to evaluate test
  3. if the test is successful, the node is then DELETED or updated (and then deleted) if necessary to keep the local Islandora instance clean

Again, I know purists will say "you should always mock up your responses" but I'm more of a pragmatist than a purist.

Allow fields to be applied to all new objects by using a single configuration value

If a value in a CSV field is the same for all new nodes (task = create), it might be useful to allow this field:value pair to be defined once in the configuration file. Some examples are #22, and #12. The logic would be if a field name: value pair is present in the config file, apply that value to all new nodes (task = create). We would probably need to create a nested map in the config file, e.g.:

node_fields:
  field_member_of: 12
  content_type: islandora_object

In this example, all of the newly created nodes would be islandora_object nodes with a field_member_of value of 12.

Relocate check for 'location' header

Currently, the check for the location HTTP response header is outside of the block that checks for a successful create (201) response:

node_response = issue_request(config, 'POST', node_endpoint, node_headers, node, None)
                node_uri = node_response.headers['location']
                if node_response.status_code == 201:
                    print("Node for '" + row['title'] +
                          "' created at " + node_uri + ".")
                    logging.info("Node for %s created at %s.", row['title'], node_uri)

The check for 'location' should go within this block.

Show media URL in output

Since we can get the URL of a newly created media in the HTTP response headers (e.g. Location: http://localhost:8000/media/4), we should show this to the user and also log it. So instead of:

Node for 'Small boats in Havana Harbour' created at http://localhost:8000/node/52.
+File media for IMG_1410.tif created.

we could show:

Node for 'Small boats in Havana Harbour' created at http://localhost:8000/node/52.
+File media for IMG_1410.tif created at http://localhost:8000/media/4.

Node reference lookup field

We may have spreadsheet rows that need to reference other rows, so we need to be able to provide a look-up value.

For example, I have a parent record and two child records, all of which use the field_identifier column. I also have a field_member_of column (could be field_part_of, etc.). The child records don't have a node id for the parent record (because it hasn't been created yet) but it does have an identifier value.

field_identifier,field_member_of,title,file
example001,,Parent record,
example001-001,example001,Child one,example001-001.tif
example001-002,example001,Child two,example001-002.tif

We should be able to tell the workbench that a certain column (e.g. field_member_of) should look up the node id based on a configured node field (e.g. field_identifier).

Support Linked Agents (TypedRelation Fields)

Islandora Defaults uses the TypedRelations FieldType for field_linked_agent. This means that a valid value for a field_linked_agent field includes both the relator (e.g. 'relators:pht' for a Photographer) and the Photographer's term id.

Also, assuming we implement multi-valued columns (#19), we could have a cell listing the relator code/term id pairs delimited by semi-colons.

Spreadsheet Editor

For bulk-editing/creating records our staff would like a spreadsheet-like experience with one row per record. Fields with multiple values (e.g. subjects) will need a delimiter. (Our staff are used to the CONTENTdm's semicolon delimiter convention although Drupal's entity autocomplete uses commas.)

There are a number of options out there, as noted by Fancy Grid's awesome-grid, so the question becomes "which one?"

A few features that we would like to see include:

  • "fill down" option for quickly duplicating row values
  • "read only" cells (so we don't accidentally change node ids when doing updates)
  • image display support for showing thumbnails for the item the row corresponds to
  • auto-complete (admittedly, we probably won't get this for free, but some sort of plugin support so we can build our own)
  • auto-complete/drop-down support for multiple values
  • grouping (for complex objects)

Early on I was keen on Handsontable, but they recently changed from an MIT license to a commercial license as of version 7.x. I've received confirmation from their sales team that we could use their "non-commercial license key", however we would need to be very clear that the workbench is not 100% open source and can't be re-released commercially without paying. We could take the Sakai approach and simply fork their 6.2.2 release (the last using MIT), or rely on Sakai's fork.

It may be better to just stick with another open source project. Options I'm considering are DataTables and x-spreadsheet.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.