mjordan / islandora_workbench Goto Github PK
View Code? Open in Web Editor NEWA command-line tool for managing content in an Islandora 2 repository
License: MIT License
A command-line tool for managing content in an Islandora 2 repository
License: MIT License
All values should be stripped of leading and trailing white space, newlines, etc.
If we can inspect the fields that a content type has configured, we can validate that the columns in the input file match machine field names, and we can also dynamically create requests for string, taxo term, and linked references.
Delimiter, quote character, etc. Provide sensible defaults.
Currently only title and description are allowed.
It is useful to resize images (if that is what we are uploading) locally since 1) desktop processor time is essentially free and 2) we can make thumbnails for the GUI editor before upload.
We could use imagemagick but users need to install it on their machine before the workbench is installed. (Searching for ways to include it natively have not worked.)
We could use the Python Pillow library (the original PIL doesn't have a Python 3 release). Presumably we could wrap calls there like we do for the ingester, although I don't know the implications for building the distribution binaries.
Another option is the Node.js Sharp module which, in some testing, performs faster than an Javascript wrapper around imagemagick. The benefit of this is we know it can integrate with an Electron app. The downside is that it doesn't appear to support Jpeg2000.
Now that we can introspect fields, we should offer a way for the user to validate their input data, e.g., all the columns in the CSV file exist, the field types are consistent with the data in the columns, and the files that are referenced exist.
Now with #6 coming along, we can inspect each field to see if it is a string, entity reference, etc. We should also probably know if the field is multivalued or not.
To make sure all the required options, for each task, are present.
This would allow centralized error handling, etc.
Some fields allow multiple values (e.g. field_subject, cardinality set on field storage config).
One way to allow multiple values is to configure the column with a sub-delimiter. For example, the file is delimited with commas, and then a particular field can be delimited with semicolons. The example CSV below has a field 'field_subject' that takes a semicolon delimited list of term ids.
field_identifier,title,field_subject,file
example001,My Image,5;7;33,example001.tif
This example extended to support field lookups (#18) could take a semicolon delimited list of subject URIs:
field_identifier,title,field_subject,file
example001,My Image,http://id.loc.gov/authorities/subjects/sh85049588;http://id.loc.gov/authorities/genreForms/gf2017027249;http://sws.geonames.org/4013157/,example001.tif
Since we can get the URL of a newly created media in the HTTP response headers (e.g. Location: http://localhost:8000/media/4
), we should show this to the user and also log it. So instead of:
Node for 'Small boats in Havana Harbour' created at http://localhost:8000/node/52.
+File media for IMG_1410.tif created.
we could show:
Node for 'Small boats in Havana Harbour' created at http://localhost:8000/node/52.
+File media for IMG_1410.tif created at http://localhost:8000/media/4.
Currently it's hard coded to "islandora_object". It would make sense to make this a field in the CSV, e.g., "content_type".
Looks like I introduced unnecessary slicing of subvalues in logic that deals with cardinality of field values. Specifically, if the cardinality of a field is -1 (unlimited), I'm slicing on this value, which results in removing the last item in the list.
I will take care of this once issue-20 branch is merged.
If a value in a CSV field is the same for all new nodes (task = create), it might be useful to allow this field:value pair to be defined once in the configuration file. Some examples are #22, and #12. The logic would be if a field name: value pair is present in the config file, apply that value to all new nodes (task = create). We would probably need to create a nested map in the config file, e.g.:
node_fields:
field_member_of: 12
content_type: islandora_object
In this example, all of the newly created nodes would be islandora_object nodes with a field_member_of
value of 12.
--check
doesn't detect that some rows in the CSV don't have the same number of columns as there are CSV headers.
Updates are complicated. Currently, updates in workbench "preserve any values in the fields, they don't replace the values." It would be useful to allow for this type of update, as well as updates that replace existing field values with ones in the input data, delete field values (#39), and remove a specific value from a multivalued field (#40).
It would also be useful to allow users to specify at the row level, rather than for all rows in the input CSV, which of these types of "update" is intended. One option (not necessarily the best) is to have the user provide a flag as the first character in the field itself that indicates which type of update they want. Typed relation fields already use a structured value using :
to separate the parts of the value, so an approach like that might be applicable. For example, the user specifies an a
for "append input values", an r
for "replace all values with input", a d
for "delete all existing values or a specific value" like this:
filed_my_field
a:additional value
d:
d:bad_value
r:new value
This would have to work with multivalued fields, and also with typed relation fields, which already use the :
.
If the user didn't want row-level granularity for a given job (i.e., all update operations were of the same kind), the flag should be configurable in the config file.
For bulk-editing/creating records our staff would like a spreadsheet-like experience with one row per record. Fields with multiple values (e.g. subjects) will need a delimiter. (Our staff are used to the CONTENTdm's semicolon delimiter convention although Drupal's entity autocomplete uses commas.)
There are a number of options out there, as noted by Fancy Grid's awesome-grid, so the question becomes "which one?"
A few features that we would like to see include:
Early on I was keen on Handsontable, but they recently changed from an MIT license to a commercial license as of version 7.x. I've received confirmation from their sales team that we could use their "non-commercial license key", however we would need to be very clear that the workbench is not 100% open source and can't be re-released commercially without paying. We could take the Sakai approach and simply fork their 6.2.2 release (the last using MIT), or rely on Sakai's fork.
It may be better to just stick with another open source project. Options I'm considering are DataTables and x-spreadsheet.
Just an idea, but related to Islandora/documentation#867:
Workbench could verify transmission fixity by generating a digest for a file prior to PUT
ing it, then, once the file has been ingested, getting its checksum via the technique implemented in mjordan/islandora_riprap#33 implemented in the Islandora Workbench Integration module.
Be sure to test on Windows.
From an input file containing node IDs or URLs.
Currently, the check for the location
HTTP response header is outside of the block that checks for a successful create (201) response:
node_response = issue_request(config, 'POST', node_endpoint, node_headers, node, None)
node_uri = node_response.headers['location']
if node_response.status_code == 201:
print("Node for '" + row['title'] +
"' created at " + node_uri + ".")
logging.info("Node for %s created at %s.", row['title'], node_uri)
The check for 'location' should go within this block.
Getting a keyError exception when running the update task. Working on it.
I haven't evaluated the popular test frameworks for Python, but given the growing complexity of workbench's code and the various combinations of field types, cardinality, etc., small changes could easily lead to regression errors. Since the main functionality of this application is interacting with Drupal via REST, any framework we use would need to be able to support mock response objects, etc. (or whatever Python calls those).
So far while hacking on workbench, I have always had a live CLAW Vagrant running. I know it's not good practice to rely on external systems in automated testing, but I'd be perfectly happy to require a vagrant as part of a test framework. In fact, our functional tests in Islandora 7.x worked against a live Islandora. That would complicate CI however.
The testing pattern I'm imagining is:
GET
is issued to get data to evaluate testDELETED
or updated (and then deleted) if necessary to keep the local Islandora instance cleanAgain, I know purists will say "you should always mock up your responses" but I'm more of a pragmatist than a purist.
We need to validate that the relations are present in the list configured in the field config, and we also need to validate that the target ID exists in the linked taxonomy. Otherwise, Drupal returns a 422 HTTP status code.
We should allow for empty file
field values, so that nodes can be created without attached media.
@kayakr points out over at Islandora/documentation#1172 that there is a Drupal contrib module for tus. There is also a tus client for Python.
To complement 'add_media'. Should be done in conjunction with #38 so we can reuse the delete code.
workbench is now over 500 lines of code and growing. It should be refactored so it is easier to read and develop in, and so that automated tests can be applied.
Most fields defined by Islandora Defaults either have a cardinality of 1 or -1 (unlimited). If a user attempts to add a number of values to a field (see #19) that exceeds the field's cardinatlity, what happens? Does Drupal thow an exception? Are the excess values ignored?
In any case, we probably want to log the fact that the number of values being added exceeds the maximum allowed.
Updating is complex. ๐
We may have spreadsheet rows that need to reference other rows, so we need to be able to provide a look-up value.
For example, I have a parent record and two child records, all of which use the field_identifier column. I also have a field_member_of column (could be field_part_of, etc.). The child records don't have a node id for the parent record (because it hasn't been created yet) but it does have an identifier value.
field_identifier,field_member_of,title,file
example001,,Parent record,
example001-001,example001,Child one,example001-001.tif
example001-002,example001,Child two,example001-002.tif
We should be able to tell the workbench that a certain column (e.g. field_member_of) should look up the node id based on a configured node field (e.g. field_identifier).
delimiter
defaults to ,
if not specified in the config file.
Currently there is no way to remove all values from a field. This should probably be a specifc case within the "update" task. Also, in the update task, allow users to replace existing values, not append to them (as happens now). Maybe the latter could be a two-step process where the user deletes all values and then updates with new ones?
Islandora Defaults uses the TypedRelations FieldType for field_linked_agent. This means that a valid value for a field_linked_agent field includes both the relator (e.g. 'relators:pht' for a Photographer) and the Photographer's term id.
Also, assuming we implement multi-valued columns (#19), we could have a cell listing the relator code/term id pairs delimited by semi-colons.
Strangely, http://admin:islandora@localhost:8000/jsonapi/field_storage_config/field_storage_config
does not appear to indicate whether a field is required or not. This information would be invaluable since if we have it, we can warn the user that they have omitted a required field. If a field is required, and it is not present in the CSV file, Drupal does not create the node.
It is duplicated in the 'create' and 'add_media' tasks. Could exist once in workbench_utils.py.
This comes from #37.
In the update task, for fields of non-1 and non-unlimited cardinality, we slice the subvalues to match the cardinality. What we should be doing is making sure the existing values remain, and then slice the rest to get the new values to add.
https://github.com/requests/requests-threads
As @dannylamb said in IRC, "managing the pool with 1 thread would be good to do, though. then you're not doing all the bookkeeping for managing the batch of requests"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.