Giter Club home page Giter Club logo

ddr-cmdln's People

Contributors

densho avatar geofffroh avatar gjost avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

camjohns raux

ddr-cmdln's Issues

Research ext2/3 drivers for Windows 7

  • ddr-cmdln will be running on the partners' local machines
  • ddr-cmdln may store collection/entity repos in folders shared with the host OS (Windows 7).
  • At some points ddr-cmdln will need to copy collection/entity repos to external USB drives connected to the host machine, and do various sync operations.
  • git-annex makes heavy use of symlinks.
  • FAT32 or NTFS don't support symlinks.

More incorrect parsing of bracketID fields

Mon, Jul 10, 2017 at 2:47 PM
[The] editor seems to be munging the "topics" dict when it reloads the data. Here's an example of the diff from ddr-pc-33/files/ddr-pc-33-15/entity.json

<<<<<<< HEAD
                "id": "Community publications: Pacific Citizen:389",
                "term": "Journalism and media"
=======
                "id": "389",
                "term": "Journalism and media: Community publications: Pacific Citizen"
>>>>>>> 26b61ec199b3e3e9ffa189caa18a2c795f8756e9

The b version is the original data; the HEAD a version is what the editor is doing to existing topic data.

Additional context, from the entity.json file this is taken from:

[
    {
<<<<<<< HEAD
        "app_commit": "00d6bf004a20c921f921fa5f28616ce642a51958  (HEAD, tag: v2.0, origin/master, origin/HEAD, master) 2017-05-03 11:27:32 -0700",
        "app_release": "0.9.4-beta",
        "application": "https://github.com/densho/ddr-cmdln.git",
        "git_version": "git version 2.1.4; git-annex version: 5.20141125\nbuild flags: Assistant Webapp Webapp-secure Pairing Testsuite S3 WebDAV Inotify DBus DesktopNotify XMPP DNS Feeds Quvi TDFA CryptoHash\nkey/value backends: SHA256E SHA1E SHA512E SHA224E SHA384E SKEIN256E SKEIN512E SHA256 SHA1 SHA512 SHA224 SHA384 SKEIN256 SKEIN512 WORM URL\nremote types: git gcrypt S3 bup directory rsync web webdav tahoe glacier ddar hook external\nlocal repository version: 5\nsupported repository version: 5\nupgrade supported from repository versions: 0 1 2 4",
        "models_commit": "8c5e0b200fe5f02c9216fd4bc3be42d46d881cf5  2017-02-01 14:36:59 -0800"
=======
        "app_commit": "9d906ffdb5df85c59fd57034abcb424bb302202d  (HEAD, origin/209-upgrade-elasticsearch, 209-upgrade-elasticsearch) 2017-01-30 17:45:05 -0800",
        "app_release": "0.9.4-beta",
        "application": "https://github.com/densho/ddr-cmdln.git",
        "git_version": "git version 2.1.4; git-annex version: 5.20141125\nbuild flags: Assistant Webapp Webapp-secure Pairing Testsuite S3 WebDAV Inotify DBus DesktopNotify XMPP DNS Feeds Quvi TDFA CryptoHash\nkey/value backends: SHA256E SHA1E SHA512E SHA224E SHA384E SKEIN256E SKEIN512E SHA256 SHA1 SHA512 SHA224 SHA384 SKEIN256 SKEIN512 WORM URL\nremote types: git gcrypt S3 bup directory rsync web webdav tahoe glacier ddar hook external\nlocal repository version: unknown\nsupported repository version: 5\nupgrade supported from repository versions: 0 1 2 4",
        "models_commit": "2106bb0a6c686e4258c0d9d02d1ced96c02f357f  2017-01-23 17:11:28 -0800"
>>>>>>> 26b61ec199b3e3e9ffa189caa18a2c795f8756e9
    },
...

ddr-filter fails with KeyError

ddr-filter on ddr-local/ddr-cmdln latest master. I've confirmed the same issue on two separate VMs (kinkura, dragontail) with two different repos (ddr-densho-68, ddr-densho-69). Both VMs were just updated with make install on latest master; both were rebooted after the install and make status looked fine. For good measure, I ran make clean, then make install-app with a reboot on dragontail with no effect on the error.

Traceback from the console:

(ddrlocal)ddr@kinkura:/media/qnfs/kinkura/working/201609$ ddr-filter -ma -d /media/qnfs/kinkura/working/201609 /media/qnfs/kinkura/gold/ddr-densho-68
Traceback (most recent call last):`
  File "/usr/local/src/env/ddrlocal/bin/ddr-filter", line 4, in <module>
    __import__('pkg_resources').run_script('ddr-cmdln==0.9.4b0', 'ddr-filter')

  File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 744, in run_script
    self.require(requires)[0].run_script(script_name, ns)

  File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1499, in run_script
    exec(code, namespace, namespace)

  File "/usr/local/src/env/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-filter", line 575, in <module>
    main()

  File "/usr/local/src/env/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-filter", line 483, in main
    exclusion_list = make_exclusion_list(args.source)

  File "/usr/local/src/env/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-filter", line 329, in make_exclusion_list
    nonpublic_json = nonpublic_json_files(files_json)

  File "/usr/local/src/env/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-filter", line 267, in nonpublic_json_files
    if is_file_json(path):

  File "/usr/local/src/env/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-filter", line 208, in is_file_json
    return re.search(MODEL_JSON_REGEX['file-json'], path)

KeyError: 'file-json'

file ingest crashes if same file added twice

File ingest fails if user tries to add files twice.
Celery error message:

Could not upload ddr-densho-296-98-1_mezz.tif to ddr-densho-296-98.
Exception('Add file aborted, see log file for details: /var/log/ddr/addfile/...

Collection repo is left in an inconsistent state.

UPDATE: Modify file ingest to look for file at destination path once it has a sha1, and quit/die if the file already exists in the repo. This will prevent modifications.

unicode tester

Write a simple script that

  • clones each DDR repo. for each repo,
  • makes a list of all the text files
  • opens each one in utf-8
  • reports the errors.

`ddr-import file` (ingest.py) does not function across mounted file systems

ddr-import file fails when trying to import files when the targeted repo is on non-local storage (i.e., nfs mount). The error is:

OSError: [Errno 18] Invalid cross-device link

Resolution: ingest.py should use shutil.move instead of os.rename. See:

https://github.com/densho/ddr-cmdln/blob/00d6bf004a20c921f921fa5f28616ce642a51958/ddr/DDR/ingest.py#L209

*(We should probably find and replace all instances of os.rename in ddr-cmdln -- this has shown up in the past and was fixed in ddr-pubcopy)

Put real data in EAD/METS files

EAD.xml and METS.xml are now written with Jinja2 instead of lxml, which is good, but we're not actually putting much data into the files, which is not so good.

git annex status not working with version 5

We're now using backports to get git-annex version 5, but functions that consume git annex status no longer work. That command was changed to git annex info. The output is slightly different now, as is the output for git annex version.

ddr-import errors

TL;DR, there is an exception when trying to import only certain entities into ddr repos. The attached stacktrace shows the error I'm getting in all cases when the import fails.

I have test clones of both ddr-manz-1 and ddr-densho-1000 on the qumulo at: /media/qnfs/kinkura/temp/201706-ddrimport/test

Some entities will import successfully. I was able to import all of the interview entities in /media/qnfs/kinkura/temp/201706-ddrimport/ddr-interviews/ddr-manz-1-interviews.csv; but several of the interviews in /media/qnfs/kinkura/temp/201706-ddrimport/ddr-interviews/ddr-densho-1000-interviews.csv fail. Some do work, however; for example, ddr-densho-1000-439. See: /media/qnfs/kinkura/temp/201706-ddrimport/ddr-segments/ddr-densho-1000-interviews-test.csv

The same error occurred when trying to import the segments for ddr-manz-1-167 (/media/qnfs/kinkura/temp/201706-ddrimport/ddr-segments/ddr-manz-1-167-segments.csv).

A few things I've already tried with no difference in behavior:

  • Running ddr-import on brand-new clones of ddr-manz-1 and ddr-densho-1000
  • Running against clones on local VM storage rather than the qumulo
  • Running on a different VM -- the same set of entities failed on both kinkura and maunakea

I also tried mixing entity data that worked with entity data that did not in the same import csv file, just to make sure there wasn't anything wrong with the particular file itself. See: /media/qnfs/kinkura/temp/201706-ddrimport/ddr-segments/ddr-manz-1-167-segments-test.csv

I've examined the rows in the csvs that work (i.e., ddr-densho-1000-439 vs. ddr-densho-1000-441 in ddr-densho-1000-interviews.csv) and couldn't find anything suspicious such as high-bit chars, extra columns, unterminated escapes, etc.

At this point, I'm not sure what else to try. Looks like it might have something to do with the identifier.py code that determines whether the entity already exists; but I'm not certain.

(ddrlocal) ddr@maunakea:/media/qnfs/kinkura/temp/201706-ddrimport/test$ ddr-import entity ../ddr-segments/ddr-manz-1-167-segments-test.csv ./ddr-manz-1
2017-06-23 11:03:43,647 DEBUG    <DDR.identifier.Identifier collection:ddr-manz-1>
2017-06-23 11:03:43,648 DEBUG    /media/qnfs/kinkura/temp/201706-ddrimport/test/ddr-manz-1
2017-06-23 11:03:43,648 INFO     ------------------------------------------------------------------------
2017-06-23 11:03:43,648 INFO     batch import entity
2017-06-23 11:03:43,651 INFO     <git.Repo "/media/qnfs/kinkura/temp/201706-ddrimport/test/ddr-manz-1/.git">
2017-06-23 11:03:43,652 INFO     Reading /media/qnfs/kinkura/temp/201706-ddrimport/ddr-segments/ddr-manz-1-167-segments-test.csv
2017-06-23 11:03:43,655 INFO     16 rows
2017-06-23 11:03:43,656 INFO     - - - - - - - - - - - - - - - - - - - - - - - -
2017-06-23 11:03:43,656 INFO     Importing
2017-06-23 11:03:43,656 INFO     1/16 - ddr-manz-1-158-1
2017-06-23 11:03:43,754 DEBUG    | <DDR.identifier.Identifier segment:ddr-manz-1-158-1> (0:00:00.097908)
2017-06-23 11:03:43,755 INFO     2/16 - ddr-manz-1-167-12
Traceback (most recent call last):
  File "/usr/local/src/ddr-local/venv/ddrlocal/bin/ddr-import", line 4, in <module>
    __import__('pkg_resources').run_script('ddr-cmdln==0.9.4b0', 'ddr-import')
  File "/usr/local/src/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 739, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/src/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1500, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/src/ddr-local/venv/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-import", line 228, in <module>
    main()
  File "/usr/local/src/ddr-local/venv/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-import", line 190, in main
    args.dryrun
  File "/usr/local/src/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/batch.py", line 525, in import_entities
    entity = eidentifier.object()
  File "/usr/local/src/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/identifier.py", line 1047, in object
    return self.object_class(mappings).from_identifier(self)
  File "/usr/local/src/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 1351, in from_identifier
    return from_json(Entity, identifier.path_abs('json'), identifier)
  File "/usr/local/src/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 366, in from_json
    document.load_json(fileio.read_text(json_path))
  File "/usr/local/src/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/fileio.py", line 16, in read_text
    raise IOError('File is missing or unreadable: %s' % path)
IOError: File is missing or unreadable: /media/qnfs/kinkura/temp/201706-ddrimport/test/ddr-manz-1/files/ddr-manz-1-167/files/ddr-manz-1-167-12/entity.json

git-status reporting repo is ahead when it appears to be synced

We are seeing this after having run ddr sync which executes DDR.commands.sync():

densho@kinkura:/media/qnfs/kinkura/gold/ddr-densho-299$ git status
# On branch master
# Your branch is ahead of 'origin/master' by 539 commits.
#   (use "git push" to publish your local commits)
#

The repository on the Gitolite server seems to contain the same commits as reported by git log:
http://partner.densho.org/cgit/cgit.cgi/ddr-densho-299

Everything is fine after manually executing the sequence of commands in the function:

ddr@kinkura:/media/qnfs/kinkura/gold/ddr-testing-259$ git checkout master
Already on 'master'
Your branch is ahead of 'origin/master' by 3 commits.
  (use "git push" to publish your local commits)
ddr@kinkura:/media/qnfs/kinkura/gold/ddr-testing-259$ git pull origin master
From mits.densho.org:ddr-testing-259
 * branch            master     -> FETCH_HEAD
   d5cf973..741f157  master     -> origin/master
Already up-to-date.
ddr@kinkura:/media/qnfs/kinkura/gold/ddr-testing-259$ git checkout git-annex 
Switched to branch 'git-annex'
ddr@kinkura:/media/qnfs/kinkura/gold/ddr-testing-259$ git pull origin git-annex
From mits.densho.org:ddr-testing-259
 * branch            git-annex  -> FETCH_HEAD
Already up-to-date.
ddr@kinkura:/media/qnfs/kinkura/gold/ddr-testing-259$ git checkout git-annex 
Already on 'git-annex'
ddr@kinkura:/media/qnfs/kinkura/gold/ddr-testing-259$ git push origin git-annex
Everything up-to-date
ddr@kinkura:/media/qnfs/kinkura/gold/ddr-testing-259$ git checkout master
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.
ddr@kinkura:/media/qnfs/kinkura/gold/ddr-testing-259$ git push origin master
Everything up-to-date
ddr@kinkura:/media/qnfs/kinkura/gold/ddr-testing-259$ git status
# On branch master
# Your branch is up-to-date with 'origin/master'.
#
nothing to commit, working directory clean

record_created set to wrong value

philip @ 7/14/2017 9:54AM

The [Record Created] field for ddr-testing-40003 says it was created Thursday 13 July 2015, 03:15 PM PDT - which I believe to be correct. However the [Record Created] field for object ddr-testing-40003-1 says it was created Thursday 13 July 2015, 01:48 PM PDT - which is not possible - I was either at lunch or had just gotten back and was not working in the testing repos anyway. Not sure how problematic this potentially could be, just something of note.

geoffJ @ 7/14/2017 10:33AM

Second odd thing - and I am not sure that this is related. The [Record Created] field for ddr-testing-40003 says it was created Thursday 13 July 2015, 03:15 PM PDT - which I believe to be correct. However the [Record Created] field for object ddr-testing-40003-1 says it was created Thursday 13 July 2015, 01:48 PM PDT - which is not possible - I was either at lunch or had just gotten back and was not working in the testing repos anyway. Not sure how problematic this potentially could be, just something of note.

Before we launch into conspiracy theories, check the datetime on that VM, it's probably off. Whenever you close a VM, saving the state, and then wake it up again you'll get clock drift. In my dev environment I always restart NTP after waking up a VM ("sudo service ntp restart") or else all my Git commit timestamps are off. In the Densho HQ it's probably best to shutdown VMs rather than saving state. Also, the host machine clock could be off.

cameron @ 7/14/2017 11:01AM

Update on the time error:

Philip and I noticed that in multiple collections, the [Record Creation] time-stamp in editor is subtracting a fixed amount of time from the initial time that the entire collection was created and then applying that resulting time signature globally across all of the collections' objects.

A formulaic depiction of the error is as follows:

(Valid Collection Creation Time) - n = Globally applied object [Record Creation] value.

  -Philip and I have looked at collections within ddr-testing and ddr-densho and have seen this popping up in numerous collections' objects, although the value of n differs across certain date-ranges of the collections' creation. 

 -Time drift has ranged from a few minutes to multiple days

*also, for time error, I should note that we checked the VM clocks and they were valid.

gjost @ 11:20am

Maybe you're looking at collections that were imported into the DDR system from CSV files (I don't have a list but Senor Froh probly does). You'd see a collection created at one time and then large numbers of entities or files all created at the same time.

gjost @ 11:31am

Were ddr-testing-40003 and ddr-testing-40003-1 created on/from the same machine? Your setup there is for everything to be stored on the Qumulo but the JSON files (and the timestamps) come from whatever VM is writing them.

One possiblity is that the background process (celery) that actually writes and commits most files might be lagging but I wouldn't expect it do lag by more than a minute or two.

Also, there's a changelog for every entity in the entity's folder (the file is called "changelog"). Every modification that's done through the app should be recorded there*.

  • Manual modifications (i.e. using a text editor) are usually not recorded unless somebody also changes the changelog. Batch operations may also neglect to update the changelog.

philip @ 11:31am

Maybe you're looking at collections that were imported into the DDR system from CSV files (I don't have a list but Senor Froh probly does). You'd see a collection created at one time and then large numbers of entities or files all created at the same time.

this might possibly exist in older collections, but I am seeing this in collections which I know were not imported from a CSV ddr-densho-332, which was made in May of 2017 and ddr-testing-40003 which was made yesterday.
ddr-densho-330 (which was made in Feb and was worked on from March through April) also has timestamp differences, though the differences are not the same for every entity (ranges from two minutes to 50+ hours).

froh @ 11:44am

There are definitely a large number of collections -- especially the earlier ddr-densho repos, most of the ddr-njpa, and a bunch of others -- that were originally created from import. The collection repo itself would have been created first -- i.e., collection.json -- then the entities would have be imported from csv.

Can you give me a sample list of a couple of the collections where you see this behavior?

And just to clarify, this isn't happening when you create new objects through the webui in a new collection, right?

philip @ 11:45am

Were ddr-testing-40003 and ddr-testing-40003-1 created on/from the same machine? Your setup there is for everything to be stored on the Qumulo but the JSON files (and the timestamps) come from whatever VM is writing them.

Yes, they were made on the same machine and at this time all the machines are running the same VM appliance.

One possiblity is that the background process (celery) that actually writes and commits most files might be lagging but I wouldn't expect it do lag by more than a minute or two.
Also, there's a changelog for every entity in the entity's folder (the file is called "changelog"). Every modification that's done through the app should be recorded there*.

looking at the entity changelog for ddr-testing-40003-1 it has the same initialization time that the collection changelog has: Thu, 13 July 2017, 03:18 PM PDT. the [Record Created] field in Editor says Thu, 13 July 2017, 01:48 PM PDT

  • Manual modifications (i.e. using a text editor) are usually not recorded unless somebody also changes the changelog. Batch operations may also neglect to update the changelog.

Also, looking at older collection/entity changelogs, the initialization/update time/date format is different.
on newer collections it is (example from ddr-testing-40003-1 entity initialization): Thu, 13 July 2017, 03:18 PM PDT
on older collections it is (example from ddr-densho-303 collection initialization): Mon, 22 Jun 2017 11:40:50
(note that the older format is on a 24-hour clock)

philip @ 11:52am

Can you give me a sample list of a couple of the collections where you see this behavior?

as in collections imported from a csv exhibiting differences in timestamps? or just collections with differences in timestamps? if you mean the later: ddr-densho-303, ddr-densho-330, ddr-densho-332, ddr-densho-325, and all of the ddr-testing collections so far.

And just to clarify, this isn't happening when you create new objects through the webui in a new collection, right?

this IS happening with new objects in new collections through the webui

file delete not completing

pkikawa 2017-08-28 11:17

Error ID:[EID_20170828_1]
Status: [ToDo]
Local Machine: [see comment]
Full Collection+Object ID Error Path: []
Time and Date of Error: [recurring error since at least Aug 1st, most reciently aug 28 at 11:00]
Brief Error Description: [deleting binary files from the webui results in a failure notification, but the files get deleted in the actual repo. runnig git status reveals files have been deleted, but changes to the object entity.json are not staged for commit.]
Steps to Replicate Error:[delete any master/mezz from the web ui (tested and repeatable in testing collections), error messages same]

ddr sync not pushing commits to mits remote correctly

ddr sync is pushing local changes to synced/master on mits; but not to master as expected.

Test routine was as follows:

  1. I made a simple change to text of collection.json, then manually committed.
ddr@kinkura:/ddr-testing-247$ nano collection.json
ddr@kinkura:/ddr-testing-247$ git status
# On branch master
# Changes not staged for commit:
#   (use "git add <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#
#       modified:   collection.json
#
no changes added to commit (use "git add" and/or "git commit -a")
ddr@kinkura:/ddr-testing-247$ git add collection.json
ddr@kinkura:/ddr-testing-247$ git commit -m"Manual commit of test changes for ddr sync testing."
[master 9e16e03] Manual commit of test changes for ddr sync testing.
 1 file changed, 1 insertion(+), 1 deletion(-)
  1. git status shows as one commit ahead of remote.
ddr@kinkura:/ddr-testing-247$ git status
# On branch master
# Your branch is ahead of 'origin/master' by 1 commit.
#
nothing to commit (working directory clean)
  1. Ran ddr sync
ddr@kinkura:/usr/local/src/ddr-local/ddrlocal$ ddr sync -u DDRAdmin -m [email protected] -c /ddr-testing-247
  1. Went back to local repo dir and checked status. Shows one commit ahead still. I tried ddr sync'ing twice for good measure with the same result. cgit on mits shows synced/master branch ahead of master.
ddr@kinkura:/usr/local/src/ddr-local/ddrlocal$ cd /ddr-testing-247
ddr@kinkura:/ddr-testing-247$ git status
# On branch master
# Your branch is ahead of 'origin/master' by 1 commit.
#
nothing to commit (working directory clean)
  1. Then tried a manual push from local. The commit is pushed to master on mits and cgit shows synced/master and master are the same.
ddr@kinkura:/ddr-testing-247$ git push
Total 0 (delta 0), reused 0 (delta 0)
To [email protected]:ddr-testing-247.git
   6b915bd..9e16e03  master -> master
ddr@kinkura:/ddr-testing-247$ git status
# On branch master
nothing to commit (working directory clean)

Notes:

  • Script returns no error.
  • Running 'git push' manually from local will push changes to remote master.

More sophisticated logic for generating signature images (ddrindex)

Currently, ddrindex generates signature images for Entities by choosing the access binary for the first mezzanine File. The Collection signature image is the access binary for the first mezzanine of the first Entity in the Collection. However, Files are ordered under their parent Entity using the 'sort' attribute in the File json.

ddrindex should use the order of the Files as determined by their 'sort' values to select the signature image, rather than simply selecting the first mezzanine file image in the filesystem sort order.

ddr-filter and ddr-pubcopy broken: "Exception('Could not import identifier definitions.')"

ddr-filter command throws an exception -- "Could not import identifier definitions."

  • Confirmed on two VMs (dragontail and kinkura at HQ).
  • Running asddr on latest master for ddr-local and ddr-cmdln
  • ddr-defs is up-to-date and present; /etc/ddr/configs are valid
  • Workbench UI is functioning
(ddrlocal)ddr@kinkura:/media/qnfs/kinkura/gold/ddr-densho-284$ ddr-filter -ma -s /media/qnfs/kinkura
/gold/ddr-densho-284 -d /media/qnfs/kinkura/working
Traceback (most recent call last):
  File "/usr/local/src/env/ddrlocal/bin/ddr-filter", line 4, in <module>
    __import__('pkg_resources').run_script('ddr-cmdln==0.9.4b0', 'ddr-filter')
  File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/pkg_resources/__init__.py", li
ne 719, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/pkg_resources/__init__.py", li
ne 1504, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/src/env/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO
/scripts/ddr-filter", line 72, in <module>
    from DDR import identifier
  File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DD
R/identifier.py", line 15, in <module>
    raise Exception('Could not import Identifier definitions!')
Exception: Could not import Identifier definitions!

Incorrect parsing of entity.topics

GFroh 2017-07-10 14:47
[It looks] like the topic term for Hawai'i is not being properly inserted into the data itself. It is being rendered as:

{
    "id": "277",
    "term": "i"
},

The other behavior I've just noticed is that the editor seems to be munging the "topics" dict when it reloads the data. Here's an example of the diff from ddr-pc-33/files/ddr-pc-33-15/entity.json

<<<<<<< HEAD
"id": "Community publications: Pacific Citizen:389",
"term": "Journalism and media"
=======
"id": "389",
"term": "Journalism and media: Community publications: Pacific Citizen"
>>>>>>> 26b61ec199b3e3e9ffa189caa18a2c795f8756e9

The b version is the original data; the HEAD a version is what the editor is doing to existing topic data.

repos that have been ddr sync'ed are still ahead of remote on mits

After running ddr sync command (directly from commandline or invoked by ddr-local webui), local collection repos still report being ahead of origin remote on mits. However, the remote on mits does appear to be receiving local changes.

  • ddr sync runs without errors
  • git status shows status ahead
  • origin\master on mits has same commit hash for HEAD pointed at master; file content matches
  • repo status can be "fixed" by manually invoking, git checkout master; git pull origin; git checkout git-annex; git pull origin; git push origin; git checkout master; git push origin

For more debug detail, see: https://docs.google.com/document/d/1cPDNWvSDXhmM4kK2ujU6nOknViKtvFKnPUsBacANT_M/

progress bars API

it would be great to be able to display progress bars for long-running processes, especially batch processes.

  • file add
  • batch import
  • batch export
  • batch update
  • ???

To clarify, we want to be able to use proc B (e.g. web UI) to check the progress of proc B (celery, ddr-cmdln).

Discrepancies in local vs known annex keys,size

< geoff.froh at densho.org > 2015-05-19 10:50:

Here's some additional followup on the odd git annex status behavior we were discussing yesterday. To summarize, in the group of ddr-njpa repos that I'm prepping for publication (ddr-njpa-1, ddr-njpa-2, ddr-njpa-4, ddr-njpa-6, ddr-njpa-8) when running git annex status, there is a discrepancy between "local annex keys"/"local annex size" vs. "known annex keys"/"known annex size". The "local" value are always less and approximately 50% of the "known" values. However, git annex whereis reports that all of the content that should exist is present in the local annex ("here"); in addition, git annex fsck is clean and git annex unused does not contain a large amount of data.

Sample output:

ddr@kinkura:/media/qnfs/kinkura/gold/ddr-njpa-1$ git annex status
...
local annex keys: 8764
local annex size: 17 gigabytes
known annex keys: 17530
known annex size: 35 gigabytes
...
git annex fsck: clean
git annex whereis: all OK; 1 copy here (qnfs)

Yesterday afternoon, I deleted the existing ddr-njpa-8 repo, and did a fresh recreate and reimport. For the recreate, I used the ddr-cmdln 'ddr create' command-line tool; the reimport used the ddr-local migration library on the current master branch. Both operations were using a fresh VM (deb 7.8) on the laptop dragontail connected to the qumulo.

I ran ddrfilter and ddrpubcopy on the newly created repo and both worked as expected, producing the expected output data. However, git annex status shows exactly the same discrepancy as the other existing ones (i.e., "local" is almost exactly half of "known").

At this point, I'm going to do a larger sample of the other repos in gold to see if this behavior is more widespread.

Again, not sure if this really matters; although I've used git annex status as one way -- along with checking the output from the git annex get op, of course -- to verify that I've got all of a partner's data copied both at the remote site when prepping the drive and at HQ when pulling data into the gold repos on the qumulo.

file.json needs explicit 'role' attribute

The 'role' attribute (whether a file is master, mezzanine or some other yet-to-be determined value) should be explicitly recorded in an attribute in file.json rather than inferred from the name of the binary file itself.

This is a change that will require a mass update of all file jsons....

ddr-export inserts extra newline chars in entity 'topics' value

When exporting entities and there is more than one value present in the 'topics' attribute, ddr-export inserts a spurious \n char for each subsequent KV pair in the list. Example:

"term:Military service: Veterans' organizations|id:19;\nterm:World War II: Military service|id:88;\nterm:World War II: Military service: 100th Infantry Battalion|id:421",

codes and usernames in git commits

Requirements:

  • There are PREMIS requirements for tracking various significant operations on Repository objects.
  • We want to track which users perform which actions.

Problems:

  • Users (possibly multiple users) in the host Windows OS will interact with the ddr-local web app, which in turn makes calls to ddr-cmdln. Inside its guest VM, ddr-local will always be running as the same user (ddrlocal? django?). Git will thus always be making commits as this user.
  • Interactions with the sandbox Gitolite server will not use pubkeys for individual users; rather they will use the VM's id_rsa.pub.
  • There may be a git commit every time the user edits a mets.xml or ead.xml. It might become necessary to separate more significant events from the noise of text edits.

Possible solution:

  • Include codes in commit message titles so they can be filtered and searched.
  • Include username in the commit title or body.

ddrfilter not moving all file content

< geoff.froh at densho.org > 09:39

Having issues running ddrfilter against a number of the ddr-njpa repos in the batch for publication. TL;DR the command does not seem to be moving all the file content when cloning the filter repo.

The two repos I've had this problem with are ddr-njpa-1 and ddr-njpa-4. I've run ddrfilter from two separate VMs on two different physical hosts (the high-power laptop where I've run these ops before, and kinkura on the VM host).

ddr-njpa-1 has 17530 files total in the repo. Requesting only access files with ddrfilter -a produces 4777 files. I dug into the ddrfilter code and found where the file list for git annex get is created (https://github.com/densho/ddr-cmdln/blob/abbb4b337ee59d15bb964fc35d6cc1856431297b/ddr/bin/ddrfilter#L128). Running git annex manually against the target repo using the same file pattern ('git annex whereis -a.jpg') produces a list with 8765 access files. The filter log, as well as the output from annex whereis for both the original repo and the filter repo are attached.

I've gone into both the original and filter repos to compare the metadata in file/entity jsons for files that were transferred vs. those that weren't. So far, I haven't found anything different -- i.e., status, privacy, etc.

FWIW, The command did work as expected on ddr-njpa-8. That repo is substantially smaller in both number of files and size than either ddr-njpa-1 and ddr-njpa-4. It might be possible that these repos are the largest in terms of number of files that we have run through ddrfilter thus far -- maybe that's a factor? Not sure. (On a side note, I don't think this is related to the duplicate binary thing we discussed in relation to the git annex status output -- ddr-njpa-8 also has the same original file used in both the master and mezzanine roles.)

For now, would you take a look at the logs and reexamine the code for anything you think might be causing this? I'm going to try manually adding the original repo annex as a remote to the filter repo and git annex getting the missing content.

Identifier object

We currently have a very complex system of object identifiers: object IDs, filesystem paths, and URLs.

These identifiers are used to do things such as:

  • uniquely identify an object
  • identify object type (model)
  • find object parents
  • locate object JSON in filesystem

Often these identifiers are parsed into parts: model, repo, org, cid, eid, role, sha1.
This will be a problem when we need to add layers.

Identifier object.

  • Can take anything as input: object ID, relative or absolute path, URL, parts.
  • Return any desired form of the ID: object ID, relative or absolute path, URL, parts.
  • Parse input using regex or something like Django URL routing. This config will live in data in the ddr-config repo.
    Objects will create an Identifier as part of initialization.
    Functions will take Identifiers as args instead of OIDs, paths, URLs, or parts.

Can't add new access files.

Adding a new file works:

>>> from django.conf import settings
>>> from webui import identifier
>>> git_name = 'gjost'
>>> git_mail = '[email protected]'
>>> entity = identifier.Identifier('ddr-testing-345-1').object()
>>> src_path = '/tmp/bernie-sanders-socialism-scare.gif'
>>> role = 'mezzanine'
>>> data = {
...  'label': u'SOCIALISM!',
...  'path': u'/tmp/bernie-sanders-socialism-scare.gif',
...  'public': u'1',
...  'rights': u'cc',
...  'sort': 1,
... }
>>> agent = 'cmdln'
>>> file_,repo,log = entity.add_file(src_path, role, data, git_name, git_mail, agent)
>>> file_,repo,log = entity.add_file_commit(file_, repo, log, git_name, git_mail, agent)
>>> result = file_.post_json(settings.DOCSTORE_HOSTS, settings.DOCSTORE_INDEX)
No handlers could be found for logger "elasticsearch.trace"

But adding a new access file does not:

>>> from django.conf import settings
>>> from webui import identifier
>>> git_name = 'gjost'
>>> git_mail = '[email protected]'
>>> entity = identifier.Identifier('ddr-testing-345-1').object()
>>> ddrfile = identifier.Identifier('ddr-testing-345-1-master-77ec5e4008').object()
>>> src_path = u'files/ddr-testing-345-1/files/ddr-testing-345-1-master-77ec5e4008.gif'
>>> agent = 'cmdln'
>>> file_,repo,log = entity.add_access(ddrfile, git_name, git_mail, agent)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    file_,repo,log = entity.add_access(ddrfile, git_name, git_mail, agent)
  File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 1252, in add_access
    return ingest.add_access(self, ddrfile, git_name, git_mail, agent='')
  File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/ingest.py", line 542, in add_access
    repo = stage_files(entity, git_files, annex_files, new_files, log, show_staged=show_staged)
  File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/ingest.py", line 283, in stage_files
    log.crash('Add file aborted, see log file for details: %s' % log.logpath)
  File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/ingest.py", line 46, in crash
    raise Exception(msg)
Exception: Add file aborted, see log file for details: /var/log/ddr/addfile/ddr-testing-345/ddr-testing-345-1.log

The operation fails leaving FILE.json modified and staged but not committed.

Here's the addfile.log

[2016-06-17T11:36:20.547080] ok - ------------------------------------------------------------------------
[2016-06-17T11:36:20.547210] ok - DDR.models.Entity.add_access: START
[2016-06-17T11:36:20.547271] ok - entity: ddr-testing-345-1
[2016-06-17T11:36:20.547461] ok - ddrfile: <webui.models.DDRFile 'ddr-testing-345-1-master-77ec5e4008'>
[2016-06-17T11:36:20.547513] ok - Checking files/dirs
[2016-06-17T11:36:20.547567] ok - check dir /var/www/media/ddr/ddr-testing-345/files/ddr-testing-345-1/files/ddr-testing-345-1-master-77ec5e4008.gif (| src_path)
[2016-06-17T11:36:20.547669] ok - Identifier
[2016-06-17T11:36:20.547715] ok - | file_id ddr-testing-345-1-master-77ec5e4008
[2016-06-17T11:36:20.547756] ok - | basepath /var/www/media/ddr
[2016-06-17T11:36:20.548498] ok - | identifier <DDR.identifier.Identifier file:ddr-testing-345-1-master-77ec5e4008>
[2016-06-17T11:36:20.549069] ok - Checking files/dirs
[2016-06-17T11:36:20.549172] ok - check dir /var/www/media/ddr/tmp/file-add/ddr-testing-345/ddr-testing-345-1 (| tmp_dir)
[2016-06-17T11:36:20.549260] ok - check dir /var/www/media/ddr/ddr-testing-345/files/ddr-testing-345-1/files (| dest_dir)
[2016-06-17T11:36:20.549366] ok - Making access file
[2016-06-17T11:36:20.549451] ok - | /var/www/media/ddr/tmp/file-add/ddr-testing-345/ddr-testing-345-1/ddr-testing-345-1-master-77ec5e4008-a.jpg
[2016-06-17T11:36:20.590559] not ok - Traceback (most recent call last):
  File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/ingest.py", line 155, in make_access_file
    geometry=config.ACCESS_FILE_GEOMETRY
  File "/usr/local/src/env/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/imaging.py", line 117, in thumbnail
    out = subprocess.check_output(cmd, shell=True)
  File "/usr/lib/python2.7/subprocess.py", line 573, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command 'convert /var/www/media/ddr/ddr-testing-345/files/ddr-testing-345-1/files/ddr-testing-345-1-master-77ec5e4008.gif[0] -resize '1024x1024>' /var/www/media/ddr/tmp/file-add/ddr-testing-345/ddr-testing-345-1/ddr-testing-345-1-master-77ec5e4008-a.jpg' returned non-zero exit status 1
[2016-06-17T11:36:20.590728] ok - File object
[2016-06-17T11:36:20.590851] ok - | file_ <webui.models.DDRFile 'ddr-testing-345-1-master-77ec5e4008'>
[2016-06-17T11:36:20.590943] ok - Writing object metadata
[2016-06-17T11:36:20.591039] ok - | /var/www/media/ddr/tmp/file-add/ddr-testing-345/ddr-testing-345-1/ddr-testing-345-1-master-77ec5e4008.json
[2016-06-17T11:36:20.591940] ok - Moving files to dest_dir
[2016-06-17T11:36:20.591998] ok - | all files moved
[2016-06-17T11:36:20.592037] ok - Moving file .json to dest_dir
[2016-06-17T11:36:20.592125] ok - | mv /var/www/media/ddr/tmp/file-add/ddr-testing-345/ddr-testing-345-1/ddr-testing-345-1-master-77ec5e4008.json /var/www/media/ddr/ddr-testing-345/files/ddr-testing-345-1/files/ddr-testing-345-1-master-77ec5e4008.json
[2016-06-17T11:36:20.592397] ok - | all files moved
[2016-06-17T11:36:20.592529] ok - Staging files
[2016-06-17T11:36:20.593645] ok - | repo <git.Repo "/var/www/media/ddr/ddr-testing-345/.git">
[2016-06-17T11:36:20.603926] ok - | 2 files to stage:
[2016-06-17T11:36:20.603996] ok - |   files/ddr-testing-345-1/files/ddr-testing-345-1-master-77ec5e4008.json
[2016-06-17T11:36:20.604033] ok - |   files/ddr-testing-345-1/files/ddr-testing-345-1-master-77ec5e4008-a.jpg
[2016-06-17T11:36:20.604087] ok - git stage
[2016-06-17T11:36:20.610856] ok - annex stage
[2016-06-17T11:36:20.630233] ok - ok
[2016-06-17T11:36:20.636601] ok - | 1 files staged:
[2016-06-17T11:36:20.636658] ok - show_staged True
[2016-06-17T11:36:20.636684] ok - |   files/ddr-testing-345-1/files/ddr-testing-345-1-master-77ec5e4008.json
[2016-06-17T11:36:20.636721] not ok - 1 new files staged (should be 2)
[2016-06-17T11:36:20.636743] not ok - File staging aborted. Cleaning up
[2016-06-17T11:36:20.636774] not ok - finished cleanup. good luck...
[2016-06-17T11:36:20.636812] not ok - Add file aborted, see log file for details: /var/log/ddr/addfile/ddr-testing-345/ddr-testing-345-1.log

ddr-local: 5bf6e4e794
ddr-cmdln: cce0071757

set meaningful lock notes, and display them

When setting a lock, write a meaningful explanation to the lockfile:
"ddr sync"
Display contents of lockfile when can't continue because of lock:
Cannot continue: collection is locked ("ddr sync")

Deal with interruptions of Git processes

Long-running Git sync operations may be interrupted by users who don't realize they can't just shut off the machine or close down the VM. Rather than hoping that this doesn't happen, or trying to prevent users doing something stupid, assume that it will happen at some point and figure out how to deal with it when it happens.

Ideally, operations should be transactional.
If an operation fails, we don't want to be left with

  • untracked files
  • staged but uncommitted files.

We DO want to log everything so we can figure out what happened if there's a problem. It would probably be good to be able to turn off cleanup.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.