Giter Club home page Giter Club logo

hydra-file_characterization's Introduction

hydra-file_characterization

Code: Gem Version Build Status Coverage Status

Docs: Contribution Guidelines Apache 2.0 License

Community Support: Samvera Community Slack

What is hydra-file_characterization?

Provides a wrapper for file characterization.

Supported versions

This software is currently tested against:

  • FITS 1.4.1
  • Ruby 2.6, 2.7, and 3.0
  • Rails 6.0, 6.1, and 7.0

Product Owner & Maintenance

hydra-file_characterization was a Core Component of the Samvera Community. Given a decline in available labor required for maintenance, this project no longer has a dedicated Product Owner. The documentation for what this means can be found here.

Product Owner

Vacant

Until a Product Owner has been identified, we ask that you please direct all requests for support, bug reports, and general questions to the #dev Channel on the Samvera Slack.

Help

The Samvera community is here to help. Please see our support guide.

Getting Started

If you are using Rails add the following to an initializer (./config/initializers/hydra-file_characterization_config.rb):

Hydra::FileCharacterization.configure do |config|
  config.tool_path(:fits, '/path/to/fits')
end
Hydra::FileCharacterization.characterize(File.read(filename), File.basename(filename), :fits)
  • Why file.read? To highlight that we want a string. In the case of ActiveFedora, we have a StringIO instead of a file.
  • Why file.basename? In the case of Fits, the characterization takes cues from the extension name.

You can call a single characterizer...

xml_string = Hydra::FileCharacterization.characterize(File.read("/path/to/my/file.rb"), 'file.rb', :fits)

...for this particular call, you can specify custom fits path...

xml_string = Hydra::FileCharacterization.characterize(contents_of_a_file, 'file.rb', :fits) do |config|
  config[:fits] = './really/custom/path/to/fits'
end

...or even make the path callable...

xml_string = Hydra::FileCharacterization.characterize(contents_of_a_file, 'file.rb', :fits) do |config|
  config[:fits] = lambda {|filename|  }
end

...or even create your custom characterizer on the file...

xml_string = Hydra::FileCharacterization.characterize(contents_of_a_file, 'file.rb', :my_characterizer) do |config|
  config[:my_characterizer] = lambda {|filename|  }
end

You can also call multiple characterizers at the same time.

fits_xml, ffprobe_xml = Hydra::FileCharacterization.characterize(contents_of_a_file, 'file.rb', :fits, :ffprobe)

Registering New Characterizers

This is possible by adding a characterizer to the Hydra::FileCharacterization::Characterizers namespace.

Contributing

Running the tests:

  • Install FITS v1.4.1, which is the most recent version we've tested against.
mkdir ~/fits
wget "https://github.com/harvard-lts/fits/releases/download/1.4.1/fits-1.4.1.zip"
unzip -d ~/fits/ "fits-1.4.1.zip"
chmod a+x ~/fits/fits.sh
ln -s ~/fits/fits.sh ~/fits/fits
rm "fits-1.4.1.zip"
  • Once FITS is installed, you should be able to run the tests with: rspec spec

If you're working on PR for this project, create a feature branch off of main.

This repository follows the Samvera Community Code of Conduct and language recommendations. Please do not create a branch called master for this repository or as part of your pull request; the branch will either need to be removed or renamed before it can be considered for inclusion in the code base and history of this repository.

Releasing

  1. bundle install
  2. Increase the version number in lib/hydra/file_characterization/version.rb
  3. Increase the same version number in .github_changelog_generator
  4. Update CHANGELOG.md by running this command:
github_changelog_generator --user samvera --project hydra-file_characterization --token YOUR_GITHUB_TOKEN_HERE
  1. Commit these changes to the master branch
  2. Run rake release

Acknowledgments

This software has been developed by and is brought to you by the Samvera community. Learn more at the Samvera website.

Samvera Logo

hydra-file_characterization's People

Contributors

bess avatar botimer avatar carolyncole avatar cbeer avatar cjcolvar avatar elrayle avatar fnibbit avatar grosscol avatar jcoyne avatar jeremyf avatar jrgriffiniii avatar little9 avatar mlooney avatar rbalekai avatar spr7b avatar tpendragon avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hydra-file_characterization's Issues

Largest width / height should be used

Currently if a multi-layer tiff is characterized, fits returns multiple imageHeight and imageWidth entries with status="CONFLICT". Nothing is done to check this status, though. It looks like current behavior is to just choose one or the other. this results in bad data.

e.g.

      <imageWidth toolname="Jhove" toolversion="1.5" status="CONFLICT">2226</imageWidth>
      <imageWidth toolname="Tika" toolversion="1.3" status="CONFLICT">160</imageWidth>
      <imageHeight toolname="Jhove" toolversion="1.5" status="CONFLICT">1650</imageHeight>
      <imageHeight toolname="Tika" toolversion="1.3" status="CONFLICT">119</imageHeight>

Sufia reports my object as having dimensions:

Height: 1650
Width: 160

I would prefer to use the largest of each value.

Add support for Rails 6.0.z releases

Rails version 6.0.0 was released on 08/16/2019, and in accordance with the charter of the current phase of the Component Maintenance Working Group, the CircleCI configuration for this should be updated.

Convention over configuration.

It should look for fits on my PATH and shouldn't require:

Hydra::FileCharacterization.configure do |config|
  config.tool_path(:fits, '/path/to/fits')
end

FITS is god-awfully slow. Let's replace it.

And it's also not great for A/V files. Since we've already got hydra-file_characterization, which wraps characterization tools, and the only benefit of using FITS is that it wraps characterization tools, I vote in favor of leaving FITS behind. Another downside of FITS is that the released versions often wrap outdated versions of the tools. Instead of FITS, we could directly wrap JHOVE(2), Exiftool, MediaInfo, file, DROID, and possibly others (FIDO?).

FITS with Nailgun?

Related to #18: FITS contains a script, fits-ngserver.sh that starts a Nailgun server, which would save us having to spin up the JVM for every call to FITS. We'd probably get a good performance boost.

me@here:~ $ fits-ngserver.sh $NAILGUN_JAR &
[1] 9820
me@here:~ $ You may now run FITS by typing: ng edu.harvard.hul.ois.fits.Fits [options]
Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar 
NGServer started on all interfaces, port 2113.

me@here:~ $ time ng-nailgun edu.harvard.hul.ois.fits.Fits -i Desktop/pul-logo-tall-md-trans_0.png -o /tmp/sample.xml

real    0m1.854s // <------------------------------
user    0m0.000s
sys 0m0.000s
NGSession 1: 127.0.0.1: edu.harvard.hul.ois.fits.Fits exited with status 0
me@here:~ $ time ng-nailgun edu.harvard.hul.ois.fits.Fits -i Desktop/pul-logo-tall-md-trans_0.png -o /tmp/sample.xml

real    0m0.464s // <------------------------------
user    0m0.000s
sys 0m0.000s
NGSession 2: 127.0.0.1: edu.harvard.hul.ois.fits.Fits exited with status 0

FITS runs Tika, which runs Tesseract, which is very slow

Related to #18 — we found that 75% or more of the time to run FITS on our 100MB TIFF files was spent running Tesseract (run by Tika). We disabled Tika by commenting out the TikaTool line in the /path/to/fits/xml/fits.xml configuration file, and saw dramatically faster FITS execution times (20 seconds per file instead of 90+).

We updated our Ansible playbook to comment out the Tika line when we install FITS: ucsdlib/ansible-role-fits#2

RENAME master branch to main

The Renaming Branch Working Group is in the process of renaming the default branch from master to main in Samvera and Samvera-Labs repos. This brings repositories into compliance with the Samvera Community Code of Conduct (https://samvera.atlassian.net/wiki/spaces/samvera/pages/405212316/Code+of+Conduct) and language recommendations (https://github.com/samvera/maintenance/blob/master/templates/CONTRIBUTING.md#language).

This issue will be complete when the master has been renamed to main.

Related issues will have a title beginning with RENAME.

RENAME: Add Circle CI step that fails if branch name is master

Descriptive summary

This repository’s default branch has already been renamed to main using GitHub’s renaming tool. In order to preserve automatic redirection of links that reference the old branch name master to the new default main branch, a branch with the old name should not be recreated.

CircleCI can be used to prevent the recreation of the old default branch name by preventing PRs with a branch named master from being merged by causing a test failure during continuous integration.

Rationale

Git's default "master" branch derives from "master/slave" jargon which perpetuates systemic racist language and systems (see email Replacing "master" reference in git branch names). To uphold our Code of Conduct, we must move away from the term "master" in our technical language (as well as words like blacklist or whitelist).

Expected behavior

If a PR is submitted with a branch named master, the continuous integration tests should fail.

Actual behavior

If a PR is submitted with a branch named master, the continuous integration tests will not fail because of the branch name.

Related work

Background on the renaming effort is available in the working group notes.

RENAME: Update references of hard-coded legacy master branch name to main branch name

Descriptive summary

This repository’s default branch has already been renamed using GitHub’s renaming tool. Links that reference the old branch name are automatically forwarded to the new default branch. But string references are not automatically updated.

Check this repository for hard-coded string references to the legacy “master” default branch and update them to the new default branch name “main.”

Important places to check include, but are not limited to:

  • READMEs
  • wikis
  • other documentation

NOTE: READMEs, wikis, and other documentation are important to update to avoid confusion and correct errors in long lasting documentation.

Less common places to check:

  • code
  • Issues/PRs

NOTE: String references to themaster branch in Issues, PRs, and code are uncommon. Also Issues and PRs are temporal in nature, making it less critical to update those occurrences.

Rationale

Git's default "master" branch derives from "master/slave" jargon which perpetuates systemic racist language and systems (see email Replacing "master" reference in git branch names). To uphold our Code of Conduct, we must move away from the term "master" in our technical language (as well as words like blacklist or whitelist).

Related work

FFmpeg, when called on a non-video file raises an error

I run this command:

      Hydra::FileCharacterization.characterize(f, [:fits, :ffprobe])

This error is raised:

 RuntimeError:
       Unable to execute command "ffprobe -i "/Users/justin/workspace/bawstun/spec/storage/qj/72/pc/13/m/broadway_or_bust.pbcore.xml" -print_format xml -show_streams -v quiet"

It seems like there should be some conditions for which this characterizer is run.

Interestingly ffmpeg is still returning valid output:

<?xml version="1.0" encoding="UTF-8"?>
<ffprobe>
</ffprobe>

but it does warn that: "Invalid data found when processing input"

Perhaps characterizers need to be able to configure whether exit status must be success:
https://github.com/projecthydra/hydra-file_characterization/blob/c5b298f6cbbeac3a708badb43e355e0db40efe7a/lib/hydra/file_characterization/characterizer.rb#L45

Add support for Ruby 2.7.z releases

Ruby 2.7.0 was released on 12/25/2019, and in accordance with the charter of the current phase of the Component Maintenance Working Group, the CircleCI configuration for this should be updated.

RENAME: Add language to README about branch naming

Add the following branch renaming language to the README for this repository.

## Contributing 

If you're working on PR for this project, create a feature branch off of `main`. 

This repository follows the [Samvera Community Code of Conduct](https://samvera.atlassian.net/wiki/spaces/samvera/pages/405212316/Code+of+Conduct) and [language recommendations](https://github.com/samvera/maintenance/blob/master/templates/CONTRIBUTING.md#language).  Please ***do not*** create a branch called `master` for this repository or as part of your pull request; the branch will either need to be removed or renamed before it can be considered for inclusion in the code base and history of this repository.

Rationale

Git's default "master" branch derives from "master/slave" jargon which perpetuates systemic racist language and systems (see email Replacing "master" reference in git branch names). To uphold our Code of Conduct, we must move away from the term "master" in our technical language (as well as words like blacklist or whitelist).

Related work

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.