artefactual / archivematica-storage-service Goto Github PK
View Code? Open in Web Editor NEWArchivematica storage service
Home Page: http://www.archivematica.org
License: GNU Affero General Public License v3.0
Archivematica storage service
Home Page: http://www.archivematica.org
License: GNU Affero General Public License v3.0
To reproduce:
Result:
The second replication will be created from the first replication, rather than from the original AIP. If you look at the pointer files for the original and replicated AIPs:
Instead, the original AIP pointer file should have two replication events and two corresponding validation events.
Merge work from JiscSD#8 and JiscSD#8 into core.
This provides a default location setting in the storage service.
Related AM issue: artefactual/archivematica#669
To recreate: Go to admin tab of SS and see that "Home" is highlighted.
The storage service contains pervasive calls to filesystem APIs (os
, sys
, shutil
) and database APIs (Django models
, Elasticsearch). If classes like package.py::Package
explicitly declared dependencies on objects (say fs
or db
) and the APIs they required, then those dependencies could more easily be swapped out in different environments, e.g., during testing or when digital objects are accessed not from a Unix filesystem but from an S3 store.
One immediate benefit of such a refactoring would be that unit and functional tests could be made to run much more quickly. If instead of creating test databases and test directory structures we swapped in mock dependencies, the runtime of these tests could probably be significantly decreased.
The metsrw PR 27 illustrates a strategy for dependency injection that might be applicable here.
Relatedly, this strategy could allow us to clean up the space-related code and make spaces pluggable dependencies, which could allow users to avoid installing unneeded third party dependencies for space types that they never use.
Refactoring SS to include dependency injection could proceed via the StranglerApplication approach/pattern.
I've learned today that in the installation instructions of CentOS we ask the user to run the following command:
$ sudo ln -sf /usr/bin/7za /usr/bin/7z
SS doesn't implement a mechanism to fallback to 7za
so if the step is missed the appliation will fail. This is an example on how this affected to one of our users: artefactual/fixity#11.
I'm not sure if this should be considered a priority but it wouldn't be hard to solve.
SS stores the remote_name
of a pipeline when it's created for consequent API access, e.g. re-ingest.
remote_name = models.CharField(
max_length=256,
default=None,
null=True,
blank=True,
help_text="Host or IP address of the pipeline server for making API calls.")
This field is also editable from the web interface.
When a pipeline is created via the SS API, this field is populated after the REMOTE_ADDR
header unless the client provides a value via the remote_name
property.
The problem is that the dashboard doesn't allow users to provide a custom value so SS always fallbacks to the value found in the REMOTE_ADDR
header which is problematic under some circumstances.
When displaying the packages/ location, the Storage Service reads all the entries of the packages table to memory before rendering ( https://github.com/artefactual/archivematica-storage-service/blob/stable/0.10.x/storage_service/locations/views.py#L45-L47 ), even though the view shows only 10 packages at a time. On an instance with 40000+ packages it takes several minutes to display. Also fearing that eventually will get out of memory errors and become unable to display.
Merge work from JiscSD#3 and JiscSD#6 into core.
This enables Shibboleth authentication (optionally) to allow login from academic institutions. This work should not concern itself with implementation of Shibboleth protocols - it will simply respond to authentication headers received from the web server and use those to create and configure users.
Related AM issue: artefactual/archivematica#666 Shibboleth integration (merge from Jisc fork) #666
#216 fixed an issue when storing AIP's in an Arkivum space. However, that fix has introduced a new problem.
When storing an AIP in Arkivum, the storage service first moves the AIP from it internal storage location to the Arkivum AIP Storage Location with operating system commands (rsync and a mv). Afterwards, a checksum is POSTed to the Arkivum REST api, so that Arkivum can verify the contents were copied successfully before continuing with its processing.
After 216, the Arkivum model looks up the checksum for compresseed AIP's in the AIP's pointer file. For uncompressed AIP's the bag manifest file is sent instead.
Arkivum version 4.2 can process a range of checksum algorithms in bag manifest files, but for individual files (e.g., compressed AIPs) only md5 is supported.
The Arkivum model needs to be updated to send only md5 for compressed AIPs.
we should be using tox.
After cloning the qa/0.8.x branch, without any local modification.
./manage.py check is going fine
./manage.py runserver tells me : You have unapplied migrations; your app may not work properly until they are applied.
if i do "makemigrations", i have the following migration created, which looks very minor to me:
class Migration(migrations.Migration):
dependencies = [
('locations', '0004_v0_7'),
]
operations = [
migrations.AlterField(
model_name='pipeline',
name='uuid',
field=django_extensions.db.fields.UUIDField(auto=False, validators=[django.core.validators.RegexValidator(b'\\w{8}-\\w{4}-\\w{4}-\\w{4}-\\w{12}', b'Needs to be format: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx where x is a hexadecimal digit.', b'Invalid UUID')], help_text=b'Identifier for the Archivematica pipeline', unique=True, verbose_name=b'UUID'),
preserve_default=True,
),
]
Currently in osdeps/, RedHat-7.json is a symlink to CentOS-7.json, which is logical. However this causes a problem with a ansible playbook in the role archivematica-src; ss-osdeps.yml fails out when doing a stat on said file (the one looked for if ansible_distribution = "RedHat") and 'isreg' (regular file rather than a symlink) returns false.
One obvious solution is to also make RedHat-7.json a regular file, slight increase in risk if do not keep it and CentOS-7.json in sync going forward.
Currently AIP deletions can only be done from the dashboard's Archival Storage tab. It is not possible to delete from the Storage Service directly (either via UI or API).
While in theory all AIPs are indexed and should appear in the Archival Storage, there are cases in which it would be useful to be able to delete an AIP not in Archival Storage.
A temporary workaround (to delete AIPs from the SS database not in the Archival Storage ) available here
For example:
css/jquery.dataTables.css
uses images/back_enabled.png
css/backbone-file-explorer.css
uses img/folderopen.png
There are more.
The current implementation of the send_callback/post_store/ api endpoint only works with AIP's that include sha512 checksums.
Archivematica 1.6.0 and greater have a new feature allowing users to choose a different checksum algorithm. The callback should be updated to support any checksum algorithm supported by the bagit spec.
Merge work from JiscSD#1 and JiscSD#7
This creates a Dockerfile along with supporting fixes in the code. Storage service should be able to run either in or out of a Docker container.
Also JiscSD#4 and JiscSD#5 for configuring gunicorn from Dockerfile
Related: artefactual/archivematica#663
When you try to re-ingest a package that it has already been re-ingested the web interface reads the following message:
Error re-ingesting package: This AIP is already being reingested on {pipeline}
Instead of {pipeline}
I would expect to see the name or the UUID of such pipeline.
Bagit-python 1.6 has been released, which includes fix LibraryOfCongress/bagit-python#63 This means that
can be removed, since it was a workaround for that issue.Not able to download AIPs on Archival Storage tab because not stored where they should be on
http://am17x.qa.archivematica.org/
If logged into SS on am17x machine and go to the packages tab, shows that they're staged on the storage service and not uploaded. They shouldn't be available on the archival storage tab.
We use unar once in the code, lsar is used twice.
Can we achieve the same with 7zip?
Best to keep the number of deps small, can we use 7zip where unar is used today?
Also, unar is not available in Alpine Linux our fav dist when we build containers.
Instead, I think we can add a task that enables the operator to add a user as desired.
I have this issue:
I use a pipeline local filesystem to link the storage with the dashboard. In the dashboard, the services are run with the archivematica
user.
For some reasons, when the storage sends files to the dashboard, it uses another user to ssh in the dashboard, called archivematica-storage
.
Both archivematica
and archivematica-storage
are in the group archivematica
, so the folders created by the storage on the dashboard look like this:
drwxrw-rw-. 1 archivematica-storage archivematica 0 Aug 9 15:26 archive
As you can see, groups can't execute the folder (so they can't see what's in). So, when the dashboard wants to use it, via the archivematica
user, it doesn't work.
For me, it makes sense that the created folders have the execute permission as long as they already have read and write, thus I suggest the following PR: #227
When a new pipeline is registered, SS guesses the IP address by looking at the REMOTE_ADDR header of the HTTP request. This mechanism has worked fine for a long time because the pipeline was either running in the same host or in a separate host with a static IP address. In more dynamic environments IP addresses may change often and SS eventually becomes unable to reach the pipeline back, for example, for re-ingest.
In JiscRDSS this issue was temporary solved by adding a new environment variable PIPELINE_REMOTE_NAME
- this environment is defined in rdss-archivematica. This enables us to use DNS names instead which works as long as the DNS server is up to date and the TCP port is always the same across al the replicas.
This recent commit 215cf18
brought in support for mysql and postgresql as db backends, instead of just sqlite3.
When running tests on this repo, sqlite3 is still used, and so it should not be necessary to install mysqlclient and other db related dependencies in the test environment.
This is important for running automated tests (travis) as well as for testing locally (only source code checked out, not testing on a full deployed archivematica server).
Sometimes new Pull Reqests add code that requires a django migration to be run. It would be very helpful to add a travis check for this to all PR's.
This tool might do the job:
https://github.com/rev112/django-migration-checker
See: artefactual/archivematica#797 - same issue for Archivematica dashboard
In many situations it's desirable to use a distributed RDBMS like MySQL or PostgreSQL. There's probably not much stopping us from supporting them other than the limitations in our configuration system.
Solution
SQLite will stay as the default backend until a new major version of SS is released.
Option 1: add a new SS_DB_ENGINE
environment variable. We would document which values are supported here, e.g. django.db.backends.sqlite3
(default), django.db.backends.mysql
and so on.
DATABASES = {
'default': {
'ENGINE': get_env_variable('SS_DB_ENGINE'), # get_env_variable should take a default value
# ...
},
}
Option 2: look up new environment variable SS_DB_URL
and parse it using dj-database-url. When SS_DB_URL
is used, the old environment strings (SS_DB_NAME
, SS_DB_USER
, SS_DB_PASSWORD
and SS_DB_HOST
) are ignored.
DATABASES = {
'default': dj_database_url.config(env='SS_DB_URL')
}
dj-database-url is very flexible. You can find many examples in test_dj_database_url.py, e.g. mysql://bea6eb025ca0d8:[email protected]/heroku_97681db3eff7580?reconnect=true
.
GPG is expecting to be able to write to a /home/archivematica directory which is not created by the SS's Dockerfile.
Attempt a metadata-only re-ingest (or other type) on an AIP in Archivematica at qa/1.x. The storeAIP micro-service will issue a PUT request against the SS's Package resources, attempting to update the AIP, which ultimately calls finish_reingest
. This breaks (returns a 500 response) because finish_reingest
is expecting a pointer file from AM which is no longer present because recent changes to Archivematica have removed the pointer file creation micro-service and pointer file creation has been moved to the Storage Service.
Tests should generate temporary data somewhere else.
How to reproduce:
$ docker run \
-e DJANGO_SETTINGS_MODULE=storage_service.settings.test \
--entrypoint pytest -t ss
administration/tests/test_languages.py ....
locations/tests/test_api.py ..............F..F...F.......
locations/tests/test_arkivum.py FFFFFF
locations/tests/test_dataverse.py ...F
locations/tests/test_dspace.py ..F..F.FF
locations/tests/test_duracloud.py .......FFFFFFFFFF
locations/tests/test_fixity_log.py .
locations/tests/test_locations.py ...
locations/tests/test_lockssomatic.py .
locations/tests/test_package.py ..........
locations/tests/test_swift.py .....FFFF.
storage_service/tests/test_shibboleth.py sssss
storage_service/tests/test_startup.py ....
Running as root fixes the problem but it's not ideal:
$ docker run \
--user=root \
-e DJANGO_SETTINGS_MODULE=storage_service.settings.test \
--entrypoint pytest -t ss
administration/tests/test_languages.py ....
locations/tests/test_api.py .............................
locations/tests/test_arkivum.py ......
locations/tests/test_dataverse.py ....
locations/tests/test_dspace.py .........
locations/tests/test_duracloud.py .................
locations/tests/test_fixity_log.py .
locations/tests/test_locations.py ...
locations/tests/test_lockssomatic.py .
locations/tests/test_package.py ..........
locations/tests/test_swift.py ..........
storage_service/tests/test_shibboleth.py sssss
storage_service/tests/test_startup.py ....
There are two files that make use of DJANGO_STATIC_ROOT
:
install/.storage-service
export DJANGO_STATIC_ROOT=/var/archivematica/storage-service/assets
install/storage-service.gunicorn-config.py
"DJANGO_STATIC_ROOT=/usr/lib/archivematica/storage-service/assets",
But it's not looked up from our settings, i.e. storage_service/storage_service/settings/base.py
reads:
STATIC_ROOT = normpath(join(SITE_ROOT, 'assets'))
I think we should update the files under install/
to stop using it.
The two fields in the Create Key page are labelled "Name real" and "Name email." It would be more clear if they were labelled "Name" and "Email."
Merge work from JiscSD#2
This adds a create_user management command to the storage service to help with automated deployment.
Related AM issue: artefactual/archivematica#665
Users want to be able to use SS's HTTP API to search across locations, packages, and files.
In testing the archivematica/1.x and archivematica-storage-service/0.x branches we encounter the following issue when a user authenticates with the external identity provider method Shibboleth:
I have a really weird behavior here... the storage is on a local pipeline setup like this:
/
/
path/to/storage/
The ingest fails to archive the AIP with this error:
locations.models.space:space:create_local_directory:482: Could not create storage directory: [Errno 13] Permission denied: '/c5fc'
From what I saw, the location path is not taken in account and it tries to put the AIP directly in the staging path of the space.
https://github.com/remileduc/archivematica-storage-service/blob/stable/0.10.x/storage_service/locations/models/space.py#L276
with destination_path
coming from https://github.com/remileduc/archivematica-storage-service/blob/stable/0.10.x/storage_service/locations/models/package.py#L412
(from what I can understand, none of them contain the location path)
So, I created a new space just for the AIP storage:
/
/path/to/storage-SPACE/
path/to/storage-LOCATION/
and it works, with this really weird log message:
locations.models.space:space:move_rsync:421: Moving from /path/to/storage-LOCATION/.../aip.7z to /path/to/storage-SPACE/.../aip.7z
This trick kind of works but in a really strange way... I don't understand the last move in the logs. There is something that doesn't make sense oO
Note: the lines given in the logs may be a bit different from the ones in master
as I've added some lines to log more stuff.
Aslo, the logs are all from /var/log/archivematica/storage-service/storage_service.log
To re-create:
Create a GPG key using the SS interface
Delete that GPG key using the SS interface
Use the gpg
command line tool to see if the key is still present. It should not be, but it is::
$ gpg --list-keys --homedir=/var/archivematica/storage_service/
Hello!
I think I spoted a bug in the communication between islandora/archidora and archivematica sword api
I am getting an error and storage-service compaints when:
a) any field in dc:title???? ???? ????: subtitle/dc:title contains NON-Latin characters!
b) or/and a filename of the deposited package also contains Non latin chars ( ie: αβγπφ )
here is the output of my log
ERROR 2016-02-17 02:16:50 locations.api.sword.helpers:helpers:_fetch_content:239: Package download task encountered an error:'ascii' codec can't encode characters in position 26-32: ordinal not in range(128)
Regards,
Harry
Replication only works for AIP Storage location, so there shouldn't be the option to choose replicators for any other type of location (unless/until that functionality is added).
The risk is a user mistakenly thinking the content from another type of location is bring replicated when it's not.
The /api/v1/file/<uuid>/extract_file/
endpoint uses the following code to extract a compressed package:
(extracted_file_path, temp_dir) = package.extract_file(relative_path_to_file)
Then it runs the following to stream the contents of the file to the client:
response = http.FileResponse(open(filepath, 'rb'))
if temp_dir and os.path.exists(temp_dir):
shutil.rmtree(temp_dir)
return response
I guess this approach works because the interpreter holds the file descriptor even when the files are deleted by shutil.rmtree
meaning that the interpreter can still read the contents of the file. Eventually when the transfer completes Django would release the file descriptor for us and the operative system would be able to claim the space.
However it's been reported by a client a problem creating a new AIC and the logs showed the following:
Traceback (most recent call last):
File "/usr/share/python/archivematica-storage-service/local/lib/python2.7/site-packages/tastypie/resources.py", line 220, in wrapper
response = callback(request, *args, **kwargs)
File "/usr/lib/archivematica/storage-service/locations/api/resources.py", line 91, in wrapper
result = func(resource, request, bundle, **kwargs)
File "/usr/lib/archivematica/storage-service/locations/api/resources.py", line 655, in extract_file_request
response = utils.download_file_stream(extracted_file_path, temp_dir)
File "/usr/lib/archivematica/storage-service/common/utils.py", line 124, in download_file_stream
shutil.rmtree(temp_dir)
File "/usr/lib/python2.7/shutil.py", line 247, in rmtree
rmtree(fullname, ignore_errors, onerror)
File "/usr/lib/python2.7/shutil.py", line 247, in rmtree
rmtree(fullname, ignore_errors, onerror)
File "/usr/lib/python2.7/shutil.py", line 256, in rmtree
onerror(os.rmdir, path, sys.exc_info())
File "/usr/lib/python2.7/shutil.py", line 254, in rmtree
os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/var/archivematica/storage_service/tmpGJtGYm/AIC_test_004-a2b1890f-c089-4d37-b96e-
39486f604445/data'
In this case, shutil.rmtree
failed to delete the data
sub-directory but I don't exactly how that can happen because the code shouldn't try to do so until it's empty (see https://github.com/python/cpython/blob/2.7/Lib/shutil.py).
I haven't been able to reproduce. One workaround could be to pass ignore_errors=True
to shutil.rmtree
(best effort). Or maybe pass an onerror
so at least we can log it.
When attempting to re-ingest an AIP (using any re-ingest type) the Django error Error re-ingesting package: An unknown error occurred
shows up. (Originally filed under RM #11419 Multi-process bagit.validate breaks AIP re-ingest but this is a distinct issue.)
Inspection of the SS logs shows Rsync failures when trying to move source files at paths like var/archivematica/sharedDirectory/www/AIPsStore/var/archivematica/storage_service/tmp4q3zkU/test1-09dc6190-1909-4cb9-8134-3081739b1f12/data
. Such paths are impossible. The bad call is happening at move_to_storage_service
, called from package.py::start_reingest
.
Further investigation reveals that the call to self.extract_file()
in start_reingest
is failing to set self.local_path_location
to the SS-internal location. The result of this is that the current_location
var in start_reingest
never gets set to self.local_path_location
, which results in relative_path
(cf. relative_path = local_path.replace(current_location.full_path, '', 1).lstrip('/')
) failing to have the correct path prefix removed. Finally, the result of all that is that reingest_files
ends up containing a bunch of nonsense paths that have relative_path
prefixed to them.
The fix would seem to be restoring the following three lines at the end of package.py::extract_file
.
if not relative_path:
self.local_path_location = ss_internal
self.local_path = output_path
These lines were removed by #224, although it is not clear to me if their removal is essential to the goal of that PR.
After having tried to install 0.2.2ppa8.deb yesterday (before the fixes were pushed out), I attempted to apt-get purge the package and try it again today (ppa11). With some parts of the filesystem not left exactly as the script expects, it fails unnecessarily:
Setting up archivematica-storage-service (0.2.2ppa11) ...
creating archivematica user
User archivematica exists
creating django secret key
[redacted]
creating symlink in /usr/lib/archivematica
ln: failed to create symbolic link `/usr/lib/archivematica/storage-service/storage_service': File exists
mv: cannot move `/var/archivematica/storage-service/static' to `/usr/lib/archivematica/storage-service/static/static': Directory not empty
mv: cannot move `/var/archivematica/storage-service/templates' to `/usr/lib/archivematica/storage-service/templates/templates': Directory not empty
configuring django database and static files
/var/lib/dpkg/info/archivematica-storage-service.postinst: 33: /var/lib/dpkg/info/archivematica-storage-service.postinst: /usr/share/python/archivematica-storage-service/bin/python: not found
/var/lib/dpkg/info/archivematica-storage-service.postinst: 34: /var/lib/dpkg/info/archivematica-storage-service.postinst: /usr/share/python/archivematica-storage-service/bin/python: not found
updating directory permissions
rm: cannot remove `/tmp/storage-service.log': No such file or directory
dpkg: error processing archivematica-storage-service (--configure):
subprocess installed post-installation script returned error exit status 1
E: Sub-process /usr/bin/dpkg returned an error code (1)
This is a big issue when SS is doing I/O bound work because the workers block and wait instead of working on other requests.
I suggest to use gevent
(add gevent==1.2.1
in requirements) and run Gunicorn with --worker-class gevent
. It has to be tested.
In #174 we localized a bunch of field properties in our models but we did not include the corresponding migration.
Scenario: Trying to get a single file from a 7zipped AIP.
When package.extract_file is called, with a relative_path argument (i.e., extract only one file), and the package was compressed with 7zip, then 7z is used to extract the a file from within the package.
If the file doesn't exist (there is no file inside the package at 'relative_path') then 7z exits, with a message that says:
No files to process
Files: 0
Size: 0
but it returns exit code 0.
The problem appears to be here:
https://github.com/artefactual/archivematica-storage-service/blob/stable/0.10.x/storage_service/locations/models/package.py#L538-L541
This code assumes exit code 0 means the file was extracted.
This was uncovered when trying to extract a manifest-sha512.txt file from an AIP that did not have that file, it has a manifest-sha256.txt.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.