denshoproject / ddr-cmdln Goto Github PK
View Code? Open in Web Editor NEWCommand-line tools for automating the Densho Digital Repository's various processes.
License: Other
Command-line tools for automating the Densho Digital Repository's various processes.
License: Other
See #21. This is likely to be a big job.
ddr-cmdln
will be running on the partners' local machinesddr-cmdln
may store collection/entity repos in folders shared with the host OS (Windows 7).ddr-cmdln
will need to copy collection/entity repos to external USB drives connected to the host machine, and do various sync operations.git-annex
makes heavy use of symlinks.Entities can only exist in the context of a collection. From the user's point of view, they will always create a collection first, then add entities to it.
Mon, Jul 10, 2017 at 2:47 PM
[The] editor seems to be munging the "topics" dict when it reloads the data. Here's an example of the diff from ddr-pc-33/files/ddr-pc-33-15/entity.json<<<<<<< HEAD "id": "Community publications: Pacific Citizen:389", "term": "Journalism and media" ======= "id": "389", "term": "Journalism and media: Community publications: Pacific Citizen" >>>>>>> 26b61ec199b3e3e9ffa189caa18a2c795f8756e9
The b version is the original data; the HEAD a version is what the editor is doing to existing topic data.
Additional context, from the entity.json
file this is taken from:
[
{
<<<<<<< HEAD
"app_commit": "00d6bf004a20c921f921fa5f28616ce642a51958 (HEAD, tag: v2.0, origin/master, origin/HEAD, master) 2017-05-03 11:27:32 -0700",
"app_release": "0.9.4-beta",
"application": "https://github.com/densho/ddr-cmdln.git",
"git_version": "git version 2.1.4; git-annex version: 5.20141125\nbuild flags: Assistant Webapp Webapp-secure Pairing Testsuite S3 WebDAV Inotify DBus DesktopNotify XMPP DNS Feeds Quvi TDFA CryptoHash\nkey/value backends: SHA256E SHA1E SHA512E SHA224E SHA384E SKEIN256E SKEIN512E SHA256 SHA1 SHA512 SHA224 SHA384 SKEIN256 SKEIN512 WORM URL\nremote types: git gcrypt S3 bup directory rsync web webdav tahoe glacier ddar hook external\nlocal repository version: 5\nsupported repository version: 5\nupgrade supported from repository versions: 0 1 2 4",
"models_commit": "8c5e0b200fe5f02c9216fd4bc3be42d46d881cf5 2017-02-01 14:36:59 -0800"
=======
"app_commit": "9d906ffdb5df85c59fd57034abcb424bb302202d (HEAD, origin/209-upgrade-elasticsearch, 209-upgrade-elasticsearch) 2017-01-30 17:45:05 -0800",
"app_release": "0.9.4-beta",
"application": "https://github.com/densho/ddr-cmdln.git",
"git_version": "git version 2.1.4; git-annex version: 5.20141125\nbuild flags: Assistant Webapp Webapp-secure Pairing Testsuite S3 WebDAV Inotify DBus DesktopNotify XMPP DNS Feeds Quvi TDFA CryptoHash\nkey/value backends: SHA256E SHA1E SHA512E SHA224E SHA384E SKEIN256E SKEIN512E SHA256 SHA1 SHA512 SHA224 SHA384 SKEIN256 SKEIN512 WORM URL\nremote types: git gcrypt S3 bup directory rsync web webdav tahoe glacier ddar hook external\nlocal repository version: unknown\nsupported repository version: 5\nupgrade supported from repository versions: 0 1 2 4",
"models_commit": "2106bb0a6c686e4258c0d9d02d1ced96c02f357f 2017-01-23 17:11:28 -0800"
>>>>>>> 26b61ec199b3e3e9ffa189caa18a2c795f8756e9
},
...
How can ddr-cmdln
code access USB drives attached to the host OS? Assume we're using VirtualBox.
files same=id detecting: only bad if same EID and same role
i think this is how it already works
ddr-filter on ddr-local/ddr-cmdln latest master. I've confirmed the same issue on two separate VMs (kinkura, dragontail) with two different repos (ddr-densho-68, ddr-densho-69). Both VMs were just updated with make install
on latest master; both were rebooted after the install and make status
looked fine. For good measure, I ran make clean
, then make install-app
with a reboot on dragontail with no effect on the error.
Traceback from the console:
(ddrlocal)ddr@kinkura:/media/qnfs/kinkura/working/201609$ ddr-filter -ma -d /media/qnfs/kinkura/working/201609 /media/qnfs/kinkura/gold/ddr-densho-68
Traceback (most recent call last):`
File "/usr/local/src/env/ddrlocal/bin/ddr-filter", line 4, in <module>
__import__('pkg_resources').run_script('ddr-cmdln==0.9.4b0', 'ddr-filter')
File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 744, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1499, in run_script
exec(code, namespace, namespace)
File "/usr/local/src/env/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-filter", line 575, in <module>
main()
File "/usr/local/src/env/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-filter", line 483, in main
exclusion_list = make_exclusion_list(args.source)
File "/usr/local/src/env/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-filter", line 329, in make_exclusion_list
nonpublic_json = nonpublic_json_files(files_json)
File "/usr/local/src/env/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-filter", line 267, in nonpublic_json_files
if is_file_json(path):
File "/usr/local/src/env/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-filter", line 208, in is_file_json
return re.search(MODEL_JSON_REGEX['file-json'], path)
KeyError: 'file-json'
File ingest fails if user tries to add files twice.
Celery error message:
Could not upload ddr-densho-296-98-1_mezz.tif to ddr-densho-296-98.
Exception('Add file aborted, see log file for details: /var/log/ddr/addfile/...
Collection repo is left in an inconsistent state.
UPDATE: Modify file ingest to look for file at destination path once it has a sha1
, and quit/die if the file already exists in the repo. This will prevent modifications.
Write a simple script that
ddr-import file
fails when trying to import files when the targeted repo is on non-local storage (i.e., nfs mount). The error is:
OSError: [Errno 18] Invalid cross-device link
Resolution: ingest.py
should use shutil.move
instead of os.rename
. See:
*(We should probably find and replace all instances of os.rename
in ddr-cmdln
-- this has shown up in the past and was fixed in ddr-pubcopy
)
EAD.xml
and METS.xml
are now written with Jinja2 instead of lxml
, which is good, but we're not actually putting much data into the files, which is not so good.
We're now using backports to get git-annex
version 5, but functions that consume git annex status
no longer work. That command was changed to git annex info
. The output is slightly different now, as is the output for git annex version
.
TL;DR, there is an exception when trying to import only certain entities into ddr repos. The attached stacktrace shows the error I'm getting in all cases when the import fails.
I have test clones of both ddr-manz-1 and ddr-densho-1000 on the qumulo at: /media/qnfs/kinkura/temp/201706-ddrimport/test
Some entities will import successfully. I was able to import all of the interview entities in /media/qnfs/kinkura/temp/201706-ddrimport/ddr-interviews/ddr-manz-1-interviews.csv
; but several of the interviews in /media/qnfs/kinkura/temp/201706-ddrimport/ddr-interviews/ddr-densho-1000-interviews.csv
fail. Some do work, however; for example, ddr-densho-1000-439. See: /media/qnfs/kinkura/temp/201706-ddrimport/ddr-segments/ddr-densho-1000-interviews-test.csv
The same error occurred when trying to import the segments for ddr-manz-1-167 (/media/qnfs/kinkura/temp/201706-ddrimport/ddr-segments/ddr-manz-1-167-segments.csv
).
A few things I've already tried with no difference in behavior:
ddr-import
on brand-new clones of ddr-manz-1 and ddr-densho-1000I also tried mixing entity data that worked with entity data that did not in the same import csv file, just to make sure there wasn't anything wrong with the particular file itself. See: /media/qnfs/kinkura/temp/201706-ddrimport/ddr-segments/ddr-manz-1-167-segments-test.csv
I've examined the rows in the csvs that work (i.e., ddr-densho-1000-439 vs. ddr-densho-1000-441 in ddr-densho-1000-interviews.csv
) and couldn't find anything suspicious such as high-bit chars, extra columns, unterminated escapes, etc.
At this point, I'm not sure what else to try. Looks like it might have something to do with the identifier.py
code that determines whether the entity already exists; but I'm not certain.
(ddrlocal) ddr@maunakea:/media/qnfs/kinkura/temp/201706-ddrimport/test$ ddr-import entity ../ddr-segments/ddr-manz-1-167-segments-test.csv ./ddr-manz-1
2017-06-23 11:03:43,647 DEBUG <DDR.identifier.Identifier collection:ddr-manz-1>
2017-06-23 11:03:43,648 DEBUG /media/qnfs/kinkura/temp/201706-ddrimport/test/ddr-manz-1
2017-06-23 11:03:43,648 INFO ------------------------------------------------------------------------
2017-06-23 11:03:43,648 INFO batch import entity
2017-06-23 11:03:43,651 INFO <git.Repo "/media/qnfs/kinkura/temp/201706-ddrimport/test/ddr-manz-1/.git">
2017-06-23 11:03:43,652 INFO Reading /media/qnfs/kinkura/temp/201706-ddrimport/ddr-segments/ddr-manz-1-167-segments-test.csv
2017-06-23 11:03:43,655 INFO 16 rows
2017-06-23 11:03:43,656 INFO - - - - - - - - - - - - - - - - - - - - - - - -
2017-06-23 11:03:43,656 INFO Importing
2017-06-23 11:03:43,656 INFO 1/16 - ddr-manz-1-158-1
2017-06-23 11:03:43,754 DEBUG | <DDR.identifier.Identifier segment:ddr-manz-1-158-1> (0:00:00.097908)
2017-06-23 11:03:43,755 INFO 2/16 - ddr-manz-1-167-12
Traceback (most recent call last):
File "/usr/local/src/ddr-local/venv/ddrlocal/bin/ddr-import", line 4, in <module>
__import__('pkg_resources').run_script('ddr-cmdln==0.9.4b0', 'ddr-import')
File "/usr/local/src/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 739, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/src/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1500, in run_script
exec(code, namespace, namespace)
File "/usr/local/src/ddr-local/venv/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-import", line 228, in <module>
main()
File "/usr/local/src/ddr-local/venv/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-import", line 190, in main
args.dryrun
File "/usr/local/src/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/batch.py", line 525, in import_entities
entity = eidentifier.object()
File "/usr/local/src/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/identifier.py", line 1047, in object
return self.object_class(mappings).from_identifier(self)
File "/usr/local/src/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 1351, in from_identifier
return from_json(Entity, identifier.path_abs('json'), identifier)
File "/usr/local/src/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 366, in from_json
document.load_json(fileio.read_text(json_path))
File "/usr/local/src/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/fileio.py", line 16, in read_text
raise IOError('File is missing or unreadable: %s' % path)
IOError: File is missing or unreadable: /media/qnfs/kinkura/temp/201706-ddrimport/test/ddr-manz-1/files/ddr-manz-1-167/files/ddr-manz-1-167-12/entity.json
files: make file-role a separate column from entity-id
e.g. "ddr-test-123-master" -> "ddr-test-123", "master"
We are seeing this after having run ddr sync
which executes DDR.commands.sync()
:
densho@kinkura:/media/qnfs/kinkura/gold/ddr-densho-299$ git status
# On branch master
# Your branch is ahead of 'origin/master' by 539 commits.
# (use "git push" to publish your local commits)
#
The repository on the Gitolite server seems to contain the same commits as reported by git log
:
http://partner.densho.org/cgit/cgit.cgi/ddr-densho-299
Everything is fine after manually executing the sequence of commands in the function:
ddr@kinkura:/media/qnfs/kinkura/gold/ddr-testing-259$ git checkout master
Already on 'master'
Your branch is ahead of 'origin/master' by 3 commits.
(use "git push" to publish your local commits)
ddr@kinkura:/media/qnfs/kinkura/gold/ddr-testing-259$ git pull origin master
From mits.densho.org:ddr-testing-259
* branch master -> FETCH_HEAD
d5cf973..741f157 master -> origin/master
Already up-to-date.
ddr@kinkura:/media/qnfs/kinkura/gold/ddr-testing-259$ git checkout git-annex
Switched to branch 'git-annex'
ddr@kinkura:/media/qnfs/kinkura/gold/ddr-testing-259$ git pull origin git-annex
From mits.densho.org:ddr-testing-259
* branch git-annex -> FETCH_HEAD
Already up-to-date.
ddr@kinkura:/media/qnfs/kinkura/gold/ddr-testing-259$ git checkout git-annex
Already on 'git-annex'
ddr@kinkura:/media/qnfs/kinkura/gold/ddr-testing-259$ git push origin git-annex
Everything up-to-date
ddr@kinkura:/media/qnfs/kinkura/gold/ddr-testing-259$ git checkout master
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.
ddr@kinkura:/media/qnfs/kinkura/gold/ddr-testing-259$ git push origin master
Everything up-to-date
ddr@kinkura:/media/qnfs/kinkura/gold/ddr-testing-259$ git status
# On branch master
# Your branch is up-to-date with 'origin/master'.
#
nothing to commit, working directory clean
Where exactly are we going to put all the entity and collection repos on the local machine?
better ddr-import check reporting: in row N of CSV, (this ID), this field was bad
don't die on the first problem: find all the bad rows, print at once
The [Record Created] field for ddr-testing-40003 says it was created Thursday 13 July 2015, 03:15 PM PDT - which I believe to be correct. However the [Record Created] field for object ddr-testing-40003-1 says it was created Thursday 13 July 2015, 01:48 PM PDT - which is not possible - I was either at lunch or had just gotten back and was not working in the testing repos anyway. Not sure how problematic this potentially could be, just something of note.
Second odd thing - and I am not sure that this is related. The [Record Created] field for ddr-testing-40003 says it was created Thursday 13 July 2015, 03:15 PM PDT - which I believe to be correct. However the [Record Created] field for object ddr-testing-40003-1 says it was created Thursday 13 July 2015, 01:48 PM PDT - which is not possible - I was either at lunch or had just gotten back and was not working in the testing repos anyway. Not sure how problematic this potentially could be, just something of note.
Before we launch into conspiracy theories, check the datetime on that VM, it's probably off. Whenever you close a VM, saving the state, and then wake it up again you'll get clock drift. In my dev environment I always restart NTP after waking up a VM ("sudo service ntp restart") or else all my Git commit timestamps are off. In the Densho HQ it's probably best to shutdown VMs rather than saving state. Also, the host machine clock could be off.
Update on the time error:
Philip and I noticed that in multiple collections, the [Record Creation] time-stamp in editor is subtracting a fixed amount of time from the initial time that the entire collection was created and then applying that resulting time signature globally across all of the collections' objects.
A formulaic depiction of the error is as follows:
(Valid Collection Creation Time) - n = Globally applied object [Record Creation] value.
-Philip and I have looked at collections within ddr-testing and ddr-densho and have seen this popping up in numerous collections' objects, although the value of n differs across certain date-ranges of the collections' creation.
-Time drift has ranged from a few minutes to multiple days
*also, for time error, I should note that we checked the VM clocks and they were valid.
Maybe you're looking at collections that were imported into the DDR system from CSV files (I don't have a list but Senor Froh probly does). You'd see a collection created at one time and then large numbers of entities or files all created at the same time.
Were ddr-testing-40003 and ddr-testing-40003-1 created on/from the same machine? Your setup there is for everything to be stored on the Qumulo but the JSON files (and the timestamps) come from whatever VM is writing them.
One possiblity is that the background process (celery) that actually writes and commits most files might be lagging but I wouldn't expect it do lag by more than a minute or two.
Also, there's a changelog for every entity in the entity's folder (the file is called "changelog"). Every modification that's done through the app should be recorded there*.
Maybe you're looking at collections that were imported into the DDR system from CSV files (I don't have a list but Senor Froh probly does). You'd see a collection created at one time and then large numbers of entities or files all created at the same time.
this might possibly exist in older collections, but I am seeing this in collections which I know were not imported from a CSV ddr-densho-332, which was made in May of 2017 and ddr-testing-40003 which was made yesterday.
ddr-densho-330 (which was made in Feb and was worked on from March through April) also has timestamp differences, though the differences are not the same for every entity (ranges from two minutes to 50+ hours).
There are definitely a large number of collections -- especially the earlier ddr-densho repos, most of the ddr-njpa, and a bunch of others -- that were originally created from import. The collection repo itself would have been created first -- i.e., collection.json -- then the entities would have be imported from csv.
Can you give me a sample list of a couple of the collections where you see this behavior?
And just to clarify, this isn't happening when you create new objects through the webui in a new collection, right?
Were ddr-testing-40003 and ddr-testing-40003-1 created on/from the same machine? Your setup there is for everything to be stored on the Qumulo but the JSON files (and the timestamps) come from whatever VM is writing them.
Yes, they were made on the same machine and at this time all the machines are running the same VM appliance.
One possiblity is that the background process (celery) that actually writes and commits most files might be lagging but I wouldn't expect it do lag by more than a minute or two.
Also, there's a changelog for every entity in the entity's folder (the file is called "changelog"). Every modification that's done through the app should be recorded there*.
looking at the entity changelog for ddr-testing-40003-1 it has the same initialization time that the collection changelog has: Thu, 13 July 2017, 03:18 PM PDT. the [Record Created] field in Editor says Thu, 13 July 2017, 01:48 PM PDT
- Manual modifications (i.e. using a text editor) are usually not recorded unless somebody also changes the changelog. Batch operations may also neglect to update the changelog.
Also, looking at older collection/entity changelogs, the initialization/update time/date format is different.
on newer collections it is (example from ddr-testing-40003-1 entity initialization): Thu, 13 July 2017, 03:18 PM PDT
on older collections it is (example from ddr-densho-303 collection initialization): Mon, 22 Jun 2017 11:40:50
(note that the older format is on a 24-hour clock)
Can you give me a sample list of a couple of the collections where you see this behavior?
as in collections imported from a csv exhibiting differences in timestamps? or just collections with differences in timestamps? if you mean the later: ddr-densho-303, ddr-densho-330, ddr-densho-332, ddr-densho-325, and all of the ddr-testing collections so far.
And just to clarify, this isn't happening when you create new objects through the webui in a new collection, right?
this IS happening with new objects in new collections through the webui
handle FIELDS['csv'][..] same as 'elasticsearch', 'form'
Add and support File.external_urls
field
We need to be able to import this from CSV, edit in ddr-local
, and display in ddr-public
.
pkikawa 2017-08-28 11:17
Error ID:[EID_20170828_1]
Status: [ToDo]
Local Machine: [see comment]
Full Collection+Object ID Error Path: []
Time and Date of Error: [recurring error since at least Aug 1st, most reciently aug 28 at 11:00]
Brief Error Description: [deleting binary files from the webui results in a failure notification, but the files get deleted in the actual repo. runnig git status reveals files have been deleted, but changes to the object entity.json are not staged for commit.]
Steps to Replicate Error:[delete any master/mezz from the web ui (tested and repeatable in testing collections), error messages same]
ddr sync is pushing local changes to synced/master on mits; but not to master as expected.
Test routine was as follows:
ddr@kinkura:/ddr-testing-247$ nano collection.json
ddr@kinkura:/ddr-testing-247$ git status
# On branch master
# Changes not staged for commit:
# (use "git add <file>..." to update what will be committed)
# (use "git checkout -- <file>..." to discard changes in working directory)
#
# modified: collection.json
#
no changes added to commit (use "git add" and/or "git commit -a")
ddr@kinkura:/ddr-testing-247$ git add collection.json
ddr@kinkura:/ddr-testing-247$ git commit -m"Manual commit of test changes for ddr sync testing."
[master 9e16e03] Manual commit of test changes for ddr sync testing.
1 file changed, 1 insertion(+), 1 deletion(-)
ddr@kinkura:/ddr-testing-247$ git status
# On branch master
# Your branch is ahead of 'origin/master' by 1 commit.
#
nothing to commit (working directory clean)
ddr@kinkura:/usr/local/src/ddr-local/ddrlocal$ ddr sync -u DDRAdmin -m [email protected] -c /ddr-testing-247
ddr@kinkura:/usr/local/src/ddr-local/ddrlocal$ cd /ddr-testing-247
ddr@kinkura:/ddr-testing-247$ git status
# On branch master
# Your branch is ahead of 'origin/master' by 1 commit.
#
nothing to commit (working directory clean)
ddr@kinkura:/ddr-testing-247$ git push
Total 0 (delta 0), reused 0 (delta 0)
To [email protected]:ddr-testing-247.git
6b915bd..9e16e03 master -> master
ddr@kinkura:/ddr-testing-247$ git status
# On branch master
nothing to commit (working directory clean)
Notes:
Currently, ddrindex generates signature images for Entities by choosing the access binary for the first mezzanine File. The Collection signature image is the access binary for the first mezzanine of the first Entity in the Collection. However, Files are ordered under their parent Entity using the 'sort' attribute in the File json.
ddrindex should use the order of the Files as determined by their 'sort' values to select the signature image, rather than simply selecting the first mezzanine file image in the filesystem sort order.
ddr-filter
command throws an exception -- "Could not import identifier definitions."
ddr
on latest master for ddr-local
and ddr-cmdln
ddr-defs
is up-to-date and present; /etc/ddr/
configs are valid(ddrlocal)ddr@kinkura:/media/qnfs/kinkura/gold/ddr-densho-284$ ddr-filter -ma -s /media/qnfs/kinkura
/gold/ddr-densho-284 -d /media/qnfs/kinkura/working
Traceback (most recent call last):
File "/usr/local/src/env/ddrlocal/bin/ddr-filter", line 4, in <module>
__import__('pkg_resources').run_script('ddr-cmdln==0.9.4b0', 'ddr-filter')
File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/pkg_resources/__init__.py", li
ne 719, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/pkg_resources/__init__.py", li
ne 1504, in run_script
exec(code, namespace, namespace)
File "/usr/local/src/env/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO
/scripts/ddr-filter", line 72, in <module>
from DDR import identifier
File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DD
R/identifier.py", line 15, in <module>
raise Exception('Could not import Identifier definitions!')
Exception: Could not import Identifier definitions!
GFroh 2017-07-10 14:47
[It looks] like the topic term for Hawai'i is not being properly inserted into the data itself. It is being rendered as:{ "id": "277", "term": "i" },
The other behavior I've just noticed is that the editor seems to be munging the "topics" dict when it reloads the data. Here's an example of the diff from
ddr-pc-33/files/ddr-pc-33-15/entity.json
<<<<<<< HEAD "id": "Community publications: Pacific Citizen:389", "term": "Journalism and media" ======= "id": "389", "term": "Journalism and media: Community publications: Pacific Citizen" >>>>>>> 26b61ec199b3e3e9ffa189caa18a2c795f8756e9
The b version is the original data; the HEAD a version is what the editor is doing to existing topic data.
After running ddr sync
command (directly from commandline or invoked by ddr-local webui), local collection repos still report being ahead of origin remote on mits. However, the remote on mits does appear to be receiving local changes.
ddr sync
runs without errorsgit status
shows status aheadorigin\master
on mits has same commit hash for HEAD
pointed at master
; file content matchesgit checkout master; git pull origin; git checkout git-annex; git pull origin; git push origin; git checkout master; git push origin
For more debug detail, see: https://docs.google.com/document/d/1cPDNWvSDXhmM4kK2ujU6nOknViKtvFKnPUsBacANT_M/
it would be great to be able to display progress bars for long-running processes, especially batch processes.
To clarify, we want to be able to use proc B (e.g. web UI) to check the progress of proc B (celery, ddr-cmdln).
< geoff.froh at densho.org > 2015-05-19 10:50:
Here's some additional followup on the odd git annex status behavior we were discussing yesterday. To summarize, in the group of ddr-njpa repos that I'm prepping for publication (ddr-njpa-1, ddr-njpa-2, ddr-njpa-4, ddr-njpa-6, ddr-njpa-8) when running git annex status, there is a discrepancy between "local annex keys"/"local annex size" vs. "known annex keys"/"known annex size". The "local" value are always less and approximately 50% of the "known" values. However, git annex whereis reports that all of the content that should exist is present in the local annex ("here"); in addition, git annex fsck is clean and git annex unused does not contain a large amount of data.
Sample output:
ddr@kinkura:/media/qnfs/kinkura/gold/ddr-njpa-1$ git annex status
...
local annex keys: 8764
local annex size: 17 gigabytes
known annex keys: 17530
known annex size: 35 gigabytes
...
git annex fsck: clean
git annex whereis: all OK; 1 copy here (qnfs)
Yesterday afternoon, I deleted the existing ddr-njpa-8 repo, and did a fresh recreate and reimport. For the recreate, I used the ddr-cmdln 'ddr create' command-line tool; the reimport used the ddr-local migration library on the current master branch. Both operations were using a fresh VM (deb 7.8) on the laptop dragontail connected to the qumulo.
I ran ddrfilter and ddrpubcopy on the newly created repo and both worked as expected, producing the expected output data. However, git annex status shows exactly the same discrepancy as the other existing ones (i.e., "local" is almost exactly half of "known").
At this point, I'm going to do a larger sample of the other repos in gold to see if this behavior is more widespread.
Again, not sure if this really matters; although I've used git annex status as one way -- along with checking the output from the git annex get op, of course -- to verify that I've got all of a partner's data copied both at the remote site when prepping the drive and at HQ when pulling data into the gold repos on the qumulo.
The 'role' attribute (whether a file is master, mezzanine or some other yet-to-be determined value) should be explicitly recorded in an attribute in file.json rather than inferred from the name of the binary file itself.
This is a change that will require a mass update of all file jsons....
When exporting entities and there is more than one value present in the 'topics' attribute, ddr-export
inserts a spurious \n
char for each subsequent KV pair in the list. Example:
"term:Military service: Veterans' organizations|id:19;\nterm:World War II: Military service|id:88;\nterm:World War II: Military service: 100th Infantry Battalion|id:421",
Entity/segment .children
lists are not populated when batch-importing from CSV.
Update: We need to update the Entity object so that it causes its parent Entity (if any) to update its children/files.
Requirements:
Problems:
id_rsa.pub
.git commit
every time the user edits a mets.xml
or ead.xml
. It might become necessary to separate more significant events from the noise of text edits.Possible solution:
When updating existing Entities, ddr-import entity
is setting important fields to empty values:
record_created
-> None
files
-> []
This is a destructive change! We cannot have this!
< geoff.froh at densho.org > 09:39
Having issues running ddrfilter against a number of the ddr-njpa repos in the batch for publication. TL;DR the command does not seem to be moving all the file content when cloning the filter repo.
The two repos I've had this problem with are ddr-njpa-1 and ddr-njpa-4. I've run ddrfilter from two separate VMs on two different physical hosts (the high-power laptop where I've run these ops before, and kinkura on the VM host).
ddr-njpa-1 has 17530 files total in the repo. Requesting only access files with ddrfilter -a produces 4777 files. I dug into the ddrfilter code and found where the file list for git annex get is created (https://github.com/densho/ddr-cmdln/blob/abbb4b337ee59d15bb964fc35d6cc1856431297b/ddr/bin/ddrfilter#L128). Running git annex manually against the target repo using the same file pattern ('git annex whereis -a.jpg') produces a list with 8765 access files. The filter log, as well as the output from annex whereis for both the original repo and the filter repo are attached.
I've gone into both the original and filter repos to compare the metadata in file/entity jsons for files that were transferred vs. those that weren't. So far, I haven't found anything different -- i.e., status, privacy, etc.
FWIW, The command did work as expected on ddr-njpa-8. That repo is substantially smaller in both number of files and size than either ddr-njpa-1 and ddr-njpa-4. It might be possible that these repos are the largest in terms of number of files that we have run through ddrfilter thus far -- maybe that's a factor? Not sure. (On a side note, I don't think this is related to the duplicate binary thing we discussed in relation to the git annex status output -- ddr-njpa-8 also has the same original file used in both the master and mezzanine roles.)
For now, would you take a look at the logs and reexamine the code for anything you think might be causing this? I'm going to try manually adding the original repo annex as a remote to the filter repo and git annex getting the missing content.
digitize_date in the Entity model is stored as a str type (as it should be); however, the form_type is still DateField, causing parsing errors.
See: https://github.com/densho/ddr-cmdln/blob/master/ddr/DDR/models/entity.json#L546
on 038-pubfilter branch: ddrfilter does not remove repo annex files because of permissioning issue. Script needs to chown -R +w /FILTER/REPO/DIR
Per 2015-05-12 emails re: Nippu Jiji collection, certain collections are cluttered with addfile.log files.
Support the porposed new format for Entity metadata, which includes children
and file_groups
.
We currently have a very complex system of object identifiers: object IDs, filesystem paths, and URLs.
These identifiers are used to do things such as:
Often these identifiers are parsed into parts: model, repo, org, cid, eid, role, sha1
.
This will be a problem when we need to add layers.
Identifier object.
ddr-config
repo.densho.import_entities() seems to be ignoring the digitize_date column in the import csv and instead inserts the current timestamp.
Adding a new file works:
>>> from django.conf import settings
>>> from webui import identifier
>>> git_name = 'gjost'
>>> git_mail = '[email protected]'
>>> entity = identifier.Identifier('ddr-testing-345-1').object()
>>> src_path = '/tmp/bernie-sanders-socialism-scare.gif'
>>> role = 'mezzanine'
>>> data = {
... 'label': u'SOCIALISM!',
... 'path': u'/tmp/bernie-sanders-socialism-scare.gif',
... 'public': u'1',
... 'rights': u'cc',
... 'sort': 1,
... }
>>> agent = 'cmdln'
>>> file_,repo,log = entity.add_file(src_path, role, data, git_name, git_mail, agent)
>>> file_,repo,log = entity.add_file_commit(file_, repo, log, git_name, git_mail, agent)
>>> result = file_.post_json(settings.DOCSTORE_HOSTS, settings.DOCSTORE_INDEX)
No handlers could be found for logger "elasticsearch.trace"
But adding a new access file does not:
>>> from django.conf import settings
>>> from webui import identifier
>>> git_name = 'gjost'
>>> git_mail = '[email protected]'
>>> entity = identifier.Identifier('ddr-testing-345-1').object()
>>> ddrfile = identifier.Identifier('ddr-testing-345-1-master-77ec5e4008').object()
>>> src_path = u'files/ddr-testing-345-1/files/ddr-testing-345-1-master-77ec5e4008.gif'
>>> agent = 'cmdln'
>>> file_,repo,log = entity.add_access(ddrfile, git_name, git_mail, agent)
Traceback (most recent call last):
File "<input>", line 1, in <module>
file_,repo,log = entity.add_access(ddrfile, git_name, git_mail, agent)
File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 1252, in add_access
return ingest.add_access(self, ddrfile, git_name, git_mail, agent='')
File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/ingest.py", line 542, in add_access
repo = stage_files(entity, git_files, annex_files, new_files, log, show_staged=show_staged)
File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/ingest.py", line 283, in stage_files
log.crash('Add file aborted, see log file for details: %s' % log.logpath)
File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/ingest.py", line 46, in crash
raise Exception(msg)
Exception: Add file aborted, see log file for details: /var/log/ddr/addfile/ddr-testing-345/ddr-testing-345-1.log
The operation fails leaving FILE.json
modified and staged but not committed.
Here's the addfile.log
[2016-06-17T11:36:20.547080] ok - ------------------------------------------------------------------------
[2016-06-17T11:36:20.547210] ok - DDR.models.Entity.add_access: START
[2016-06-17T11:36:20.547271] ok - entity: ddr-testing-345-1
[2016-06-17T11:36:20.547461] ok - ddrfile: <webui.models.DDRFile 'ddr-testing-345-1-master-77ec5e4008'>
[2016-06-17T11:36:20.547513] ok - Checking files/dirs
[2016-06-17T11:36:20.547567] ok - check dir /var/www/media/ddr/ddr-testing-345/files/ddr-testing-345-1/files/ddr-testing-345-1-master-77ec5e4008.gif (| src_path)
[2016-06-17T11:36:20.547669] ok - Identifier
[2016-06-17T11:36:20.547715] ok - | file_id ddr-testing-345-1-master-77ec5e4008
[2016-06-17T11:36:20.547756] ok - | basepath /var/www/media/ddr
[2016-06-17T11:36:20.548498] ok - | identifier <DDR.identifier.Identifier file:ddr-testing-345-1-master-77ec5e4008>
[2016-06-17T11:36:20.549069] ok - Checking files/dirs
[2016-06-17T11:36:20.549172] ok - check dir /var/www/media/ddr/tmp/file-add/ddr-testing-345/ddr-testing-345-1 (| tmp_dir)
[2016-06-17T11:36:20.549260] ok - check dir /var/www/media/ddr/ddr-testing-345/files/ddr-testing-345-1/files (| dest_dir)
[2016-06-17T11:36:20.549366] ok - Making access file
[2016-06-17T11:36:20.549451] ok - | /var/www/media/ddr/tmp/file-add/ddr-testing-345/ddr-testing-345-1/ddr-testing-345-1-master-77ec5e4008-a.jpg
[2016-06-17T11:36:20.590559] not ok - Traceback (most recent call last):
File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/ingest.py", line 155, in make_access_file
geometry=config.ACCESS_FILE_GEOMETRY
File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/imaging.py", line 117, in thumbnail
out = subprocess.check_output(cmd, shell=True)
File "/usr/lib/python2.7/subprocess.py", line 573, in check_output
raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command 'convert /var/www/media/ddr/ddr-testing-345/files/ddr-testing-345-1/files/ddr-testing-345-1-master-77ec5e4008.gif[0] -resize '1024x1024>' /var/www/media/ddr/tmp/file-add/ddr-testing-345/ddr-testing-345-1/ddr-testing-345-1-master-77ec5e4008-a.jpg' returned non-zero exit status 1
[2016-06-17T11:36:20.590728] ok - File object
[2016-06-17T11:36:20.590851] ok - | file_ <webui.models.DDRFile 'ddr-testing-345-1-master-77ec5e4008'>
[2016-06-17T11:36:20.590943] ok - Writing object metadata
[2016-06-17T11:36:20.591039] ok - | /var/www/media/ddr/tmp/file-add/ddr-testing-345/ddr-testing-345-1/ddr-testing-345-1-master-77ec5e4008.json
[2016-06-17T11:36:20.591940] ok - Moving files to dest_dir
[2016-06-17T11:36:20.591998] ok - | all files moved
[2016-06-17T11:36:20.592037] ok - Moving file .json to dest_dir
[2016-06-17T11:36:20.592125] ok - | mv /var/www/media/ddr/tmp/file-add/ddr-testing-345/ddr-testing-345-1/ddr-testing-345-1-master-77ec5e4008.json /var/www/media/ddr/ddr-testing-345/files/ddr-testing-345-1/files/ddr-testing-345-1-master-77ec5e4008.json
[2016-06-17T11:36:20.592397] ok - | all files moved
[2016-06-17T11:36:20.592529] ok - Staging files
[2016-06-17T11:36:20.593645] ok - | repo <git.Repo "/var/www/media/ddr/ddr-testing-345/.git">
[2016-06-17T11:36:20.603926] ok - | 2 files to stage:
[2016-06-17T11:36:20.603996] ok - | files/ddr-testing-345-1/files/ddr-testing-345-1-master-77ec5e4008.json
[2016-06-17T11:36:20.604033] ok - | files/ddr-testing-345-1/files/ddr-testing-345-1-master-77ec5e4008-a.jpg
[2016-06-17T11:36:20.604087] ok - git stage
[2016-06-17T11:36:20.610856] ok - annex stage
[2016-06-17T11:36:20.630233] ok - ok
[2016-06-17T11:36:20.636601] ok - | 1 files staged:
[2016-06-17T11:36:20.636658] ok - show_staged True
[2016-06-17T11:36:20.636684] ok - | files/ddr-testing-345-1/files/ddr-testing-345-1-master-77ec5e4008.json
[2016-06-17T11:36:20.636721] not ok - 1 new files staged (should be 2)
[2016-06-17T11:36:20.636743] not ok - File staging aborted. Cleaning up
[2016-06-17T11:36:20.636774] not ok - finished cleanup. good luck...
[2016-06-17T11:36:20.636812] not ok - Add file aborted, see log file for details: /var/log/ddr/addfile/ddr-testing-345/ddr-testing-345-1.log
ddr-local: 5bf6e4e794
ddr-cmdln: cce0071757
Single command that initialized the facets and mappings.
When setting a lock, write a meaningful explanation to the lockfile:
"ddr sync"
Display contents of lockfile when can't continue because of lock:
Cannot continue: collection is locked ("ddr sync")
Long-running Git sync operations may be interrupted by users who don't realize they can't just shut off the machine or close down the VM. Rather than hoping that this doesn't happen, or trying to prevent users doing something stupid, assume that it will happen at some point and figure out how to deal with it when it happens.
Ideally, operations should be transactional.
If an operation fails, we don't want to be left with
We DO want to log everything so we can figure out what happened if there's a problem. It would probably be good to be able to turn off cleanup.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.