samvera / hydra-file_characterization Goto Github PK
View Code? Open in Web Editor NEWSamvera file characterization (extracted from Sufia)
License: Other
Samvera file characterization (extracted from Sufia)
License: Other
This is often the case where the file is an external datastream and we don't want to load the contents into memory.
Given I have a compressed files containing other files
When I call FileCharacterization with a path to the compressed file
Then I should get a XML meta-data for the compressed file and not its contained files
Rails version 6.0.0 was released on 08/16/2019, and in accordance with the charter of the current phase of the Component Maintenance Working Group, the CircleCI configuration for this should be updated.
Currently if a multi-layer tiff is characterized, fits returns multiple imageHeight
and imageWidth
entries with status="CONFLICT"
. Nothing is done to check this status, though. It looks like current behavior is to just choose one or the other. this results in bad data.
e.g.
<imageWidth toolname="Jhove" toolversion="1.5" status="CONFLICT">2226</imageWidth>
<imageWidth toolname="Tika" toolversion="1.3" status="CONFLICT">160</imageWidth>
<imageHeight toolname="Jhove" toolversion="1.5" status="CONFLICT">1650</imageHeight>
<imageHeight toolname="Tika" toolversion="1.3" status="CONFLICT">119</imageHeight>
Sufia reports my object as having dimensions:
Height: 1650
Width: 160
I would prefer to use the largest of each value.
All samvera projects should be using the same style checker
This repository’s default branch has already been renamed to main
using GitHub’s renaming tool. In order to preserve automatic redirection of links that reference the old branch name master
to the new default main
branch, a branch with the old name should not be recreated.
CircleCI can be used to prevent the recreation of the old default branch name by preventing PRs with a branch named master
from being merged by causing a test failure during continuous integration.
Git's default "master" branch derives from "master/slave" jargon which perpetuates systemic racist language and systems (see email Replacing "master" reference in git branch names). To uphold our Code of Conduct, we must move away from the term "master" in our technical language (as well as words like blacklist or whitelist).
If a PR is submitted with a branch named master, the continuous integration tests should fail.
If a PR is submitted with a branch named master, the continuous integration tests will not fail because of the branch name.
Background on the renaming effort is available in the working group notes.
Ruby 2.7.0 was released on 12/25/2019, and in accordance with the charter of the current phase of the Component Maintenance Working Group, the CircleCI configuration for this should be updated.
I run this command:
Hydra::FileCharacterization.characterize(f, [:fits, :ffprobe])
This error is raised:
RuntimeError:
Unable to execute command "ffprobe -i "/Users/justin/workspace/bawstun/spec/storage/qj/72/pc/13/m/broadway_or_bust.pbcore.xml" -print_format xml -show_streams -v quiet"
It seems like there should be some conditions for which this characterizer is run.
Interestingly ffmpeg is still returning valid output:
<?xml version="1.0" encoding="UTF-8"?>
<ffprobe>
</ffprobe>
but it does warn that: "Invalid data found when processing input"
Perhaps characterizers need to be able to configure whether exit status must be success:
https://github.com/projecthydra/hydra-file_characterization/blob/c5b298f6cbbeac3a708badb43e355e0db40efe7a/lib/hydra/file_characterization/characterizer.rb#L45
Derived from samvera/maintenance#16
Derived from samvera/maintenance#17
Given I have a File on the file system
And the File is confusing to characterize
When I call FileCharacterization with the path to the File
Then I raise a RuntimeError
Derived from samvera/maintenance#76
Using a service such as Coveralls, integrate code coverage analysis to ensure 100% coverage
It should look for fits on my PATH and shouldn't require:
Hydra::FileCharacterization.configure do |config|
config.tool_path(:fits, '/path/to/fits')
end
And it's also not great for A/V files. Since we've already got hydra-file_characterization, which wraps characterization tools, and the only benefit of using FITS is that it wraps characterization tools, I vote in favor of leaving FITS behind. Another downside of FITS is that the released versions often wrap outdated versions of the tools. Instead of FITS, we could directly wrap JHOVE(2), Exiftool, MediaInfo, file, DROID, and possibly others (FIDO?).
Right now they are still pointing to hydra/hydra-file_characterization.
Given I have a File on the file system
When I call FileCharacterization with the path to the File
Then I get a raw XML string
The XML output should contain the file mime type and file size at a minimum.
This repository’s default branch has already been renamed using GitHub’s renaming tool. Links that reference the old branch name are automatically forwarded to the new default branch. But string references are not automatically updated.
Check this repository for hard-coded string references to the legacy “master” default branch and update them to the new default branch name “main.”
Important places to check include, but are not limited to:
NOTE: READMEs, wikis, and other documentation are important to update to avoid confusion and correct errors in long lasting documentation.
Less common places to check:
NOTE: String references to themaster
branch in Issues, PRs, and code are uncommon. Also Issues and PRs are temporal in nature, making it less critical to update those occurrences.
Git's default "master" branch derives from "master/slave" jargon which perpetuates systemic racist language and systems (see email Replacing "master" reference in git branch names). To uphold our Code of Conduct, we must move away from the term "master" in our technical language (as well as words like blacklist or whitelist).
Related to #18: FITS contains a script, fits-ngserver.sh
that starts a Nailgun server, which would save us having to spin up the JVM for every call to FITS. We'd probably get a good performance boost.
me@here:~ $ fits-ngserver.sh $NAILGUN_JAR &
[1] 9820
me@here:~ $ You may now run FITS by typing: ng edu.harvard.hul.ois.fits.Fits [options]
Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar
NGServer started on all interfaces, port 2113.
me@here:~ $ time ng-nailgun edu.harvard.hul.ois.fits.Fits -i Desktop/pul-logo-tall-md-trans_0.png -o /tmp/sample.xml
real 0m1.854s // <------------------------------
user 0m0.000s
sys 0m0.000s
NGSession 1: 127.0.0.1: edu.harvard.hul.ois.fits.Fits exited with status 0
me@here:~ $ time ng-nailgun edu.harvard.hul.ois.fits.Fits -i Desktop/pul-logo-tall-md-trans_0.png -o /tmp/sample.xml
real 0m0.464s // <------------------------------
user 0m0.000s
sys 0m0.000s
NGSession 2: 127.0.0.1: edu.harvard.hul.ois.fits.Fits exited with status 0
This follows the proposed maintenance reorganization within samvera/maintenance#137
Add the following branch renaming language to the README for this repository.
## Contributing
If you're working on PR for this project, create a feature branch off of `main`.
This repository follows the [Samvera Community Code of Conduct](https://samvera.atlassian.net/wiki/spaces/samvera/pages/405212316/Code+of+Conduct) and [language recommendations](https://github.com/samvera/maintenance/blob/master/templates/CONTRIBUTING.md#language). Please ***do not*** create a branch called `master` for this repository or as part of your pull request; the branch will either need to be removed or renamed before it can be considered for inclusion in the code base and history of this repository.
Git's default "master" branch derives from "master/slave" jargon which perpetuates systemic racist language and systems (see email Replacing "master" reference in git branch names). To uphold our Code of Conduct, we must move away from the term "master" in our technical language (as well as words like blacklist or whitelist).
Derived from samvera/maintenance#89
The maintenance template for CONTRIBUTING.md is used by all Samvera repos. Replace the current version with the updated maintenance template version.
This is derived from samvera/maintenance#119, and follows the guide in https://guides.rubygems.org/mfa-requirement-opt-in/
Derived from samvera/maintenance#77
Related to #18 — we found that 75% or more of the time to run FITS on our 100MB TIFF files was spent running Tesseract (run by Tika). We disabled Tika by commenting out the TikaTool line in the /path/to/fits/xml/fits.xml
configuration file, and saw dramatically faster FITS execution times (20 seconds per file instead of 90+).
We updated our Ansible playbook to comment out the Tika line when we install FITS: ucsdlib/ansible-role-fits#2
This should match the template found within https://github.com/samvera/maintenance/tree/main/templates
Depends on #36
Gemnasium has been acquired by GitLab and has reached its end-of-life. Further, GitHub now freely provides security alerts: https://help.github.com/articles/about-security-alerts-for-vulnerable-dependencies/
Given that I have a File on the file system
When I call FileCharacterization with the path to a non-existent File
Then I raise a RuntimeError exception
The Renaming Branch Working Group is in the process of renaming the default branch from master
to main
in Samvera and Samvera-Labs repos. This brings repositories into compliance with the Samvera Community Code of Conduct (https://samvera.atlassian.net/wiki/spaces/samvera/pages/405212316/Code+of+Conduct) and language recommendations (https://github.com/samvera/maintenance/blob/master/templates/CONTRIBUTING.md#language).
This issue will be complete when the master
has been renamed to main
.
Related issues will have a title beginning with RENAME
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.