Giter Club home page Giter Club logo

bagging's People

Contributors

bhimak avatar dependabot-preview[bot] avatar dmoles avatar jbleduigou avatar jscancella avatar kcclaas avatar renovate[bot] avatar sprater avatar volkerhartmann avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

bagging's Issues

In place bag creation

@jscancella Thanks for maintaining this library!

I was just working on using it to create some bags. The problem I encountered is that it is designed to create a bag by making copies of all of the payload files. For my use case, I do not want this duplication.

I browsed the code, and it looks like it would be fairly easy to support either creating bags in place or optionally allowing moves instead of copies.

For creating bags in place, I believe the only modification that would be needed is a check to see if the entry and newEntry physical locations are the same (here) and do nothing if they are. However, I see that you have a warning here that would fire if you did try to create the bag in place. The warning text suggests that it won't create the bag, but it doesn't actually appear to short circuit.

If you don't like the idea of creating a bag in place, then perhaps moves could be supported? You could have an optional option either at the builder level or at the payload file level to indicate if files should be moved or copied. A slightly larger change, but perhaps more in line with how you intend the builder to be used?

If you are amenable to either of these ideas, I'd be happy to send you a PR, if you'd like.

BagBuilder.addPayloadFile() does not construct paths within the bag's data directory when the specified path is a directory

Example:

var src = Files.createDirectories(Paths.get("/var/tmp/foo"));
var dst = Paths.get("/var/tmp/foo-bag");

Files.writeString(src.resolve("file.txt"), "bar");

new BagBuilder()
        .addAlgorithm("md5")
        .addPayloadFile(src)
        .bagLocation(dst)
        .write();

try (var walk = Files.walk(dst)) {
   walk.forEach(System.out::println);
}

Expected output:

/var/tmp/foo-bag
/var/tmp/foo-bag/data/foo
/var/tmp/foo-bag/data/foo/file.txt
/var/tmp/foo-bag/manifest-md5.txt
/var/tmp/foo-bag/bagit.txt
/var/tmp/foo-bag/tagmanifest-md5.txt

Actual output:

/var/tmp/foo-bag
/var/tmp/foo-bag/foo
/var/tmp/foo-bag/foo/file.txt
/var/tmp/foo-bag/manifest-md5.txt
/var/tmp/foo-bag/bagit.txt
/var/tmp/foo-bag/tagmanifest-md5.txt

The issue appears to be that ManifestBuilderVistor does not use the relative path when compiling ManifestEntries.

Convert domain objects to be completely immutable

Currently the domain objects allow for some changes after creation. Create a pull request for review that changes the domain objects (like Bag) to be completely immutable and throw modification error when trying to modify them.

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Detected dependencies

circleci
.circleci/config.yml
  • cimg/openjdk 17.0.11
  • cimg/openjdk 21.0
gradle
code-quality.gradle
  • pmd 6.55.0
eclipse.gradle
maven-central.gradle
message-bundle.gradle
settings.gradle
build.gradle
  • com.github.nbaztec.coveralls-jacoco 1.2.20
  • de.aaschmid.cpd 3.4
  • org.ajoberstar.grgit 5.2.2
  • com.github.spotbugs 6.0.20
  • info.solidsoft.pitest 1.15.0
  • com.dorongold.task-tree 2.1.1
  • org.slf4j:slf4j-api 2.0.16
  • com.fasterxml.jackson.core:jackson-core 2.17.2
  • com.fasterxml.jackson.core:jackson-databind 2.17.2
  • org.junit.jupiter:junit-jupiter 5.11.0
  • org.springframework.boot:spring-boot-starter-logging 3.3.2
  • org.bouncycastle:bcprov-jdk15on 1.70
  • org.kamranzafar:jtar 2.3
gradle-wrapper
gradle/wrapper/gradle-wrapper.properties
  • gradle 8.10

  • Check this box to trigger a request for Renovate to run again on this repository

Saving extra tag files in directory (not in bag itself!)

When submitting an issue please include:

  • a small code example showing the incorrect behavior
    BagBuilder builder = new BagBuilder();
    builder.addAlgorithm("md5")
    .addMetadata("foo", "bar")
    .addPayloadFile(payloadFiles)
    .addTagFile(Paths.get(absolute_dir + "/import/meta"))
    .bagLocation(Paths.get(absolute_dir + "/export/bag)
    .write();

  • the expected behavior
    Structure of Bag MUST be
    data
    meta (with files for example: mods.xml, rights.xml)
    bag-info.txt
    bagit.txt
    manifest-md5.txt
    tagmanifest-md5.txt (Must include: xxxxxxxxxxxxxxxxxxxxxxxx meta/mods.xml)
    (Must include: xxxxxxxxxxxxxxxxxxxxxxxx meta/rights.xml)

  • the actual behavior
    Structure of Bag INCORRECT
    data
    mods.xml
    rights.xml
    bag-info.txt
    bagit.txt
    manifest-md5.txt
    tagmanifest-md5.txt

  • the operating system being used, and its version
    debian 10

  • version of Bagging being used
    4.2

Fix Badges

Several of the Badges listed in the README.md file are missing or displaying incorrect information (like java doc not actually displaying any javadocs). Investigate why and submit pull request for fxing.

update custom PMD ruleset

The current PMD ruleset has many exceptions. We should look over them and see what ones should really be excepted and document why (comments in the xml?).

For example, we probably don't want an exception for MethodArgumentCouldBeFinal, but we do want an exception for DataflowAnomalyAnalysis since it has a bug in it that doesn't recognize variables created in for-each loops.

Add Builder Pattern to Bag

Several people have requested that the ability to add or remove files to/from a bag easily be added. Currently the programmer has to access the Bag's internal map of checksums and add the file to them, calculating the checksum, and then write it to disk.

Investigate adding the Builder pattern (i.e. Bag.add(FILE1).add(FILE2).build() to create a new Bag object that can then be written to disk.

Action Required: Fix Renovate Configuration

There is an error with this repository's Renovate configuration that needs to be fixed. As a precaution, Renovate will stop PRs until it is resolved.

Location: renovate.json
Error type: The renovate configuration file contains some invalid settings
Message: Invalid configuration option: packageRules[0].matchDatasource

Javdoc validation fails during build

After cloning the repository into a fresh local repo, I attempt to build the project while on the master branch:

./gradlew clean check

The following Javadoc generation errors occur:

> Task :javadoc
C:\Projects\bagging\src\main\java\com\github\jscancella\conformance\internal\LargeBagChecker.java:35: error: invalid end tag: </br>
   * Check if a bag is "large", which is: </br>
                                          ^
C:\Projects\bagging\src\main\java\com\github\jscancella\verify\internal\BagitTextFileVerifier.java:23: warning: no description for @param
   * @param bag
     ^
C:\Projects\bagging\src\main\java\com\github\jscancella\verify\internal\BagitTextFileVerifier.java:24: warning: no description for @throws
   * @throws IOException
     ^
1 error
2 warnings

> Task :javadoc FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':javadoc'.
> Javadoc generation failed. Generated Javadoc options file (useful for troubleshooting): 'C:\Projects\bagging\build\tmp\javadoc\javadoc.options'

OS: Windows 10 Enterprise (64 bit)
Java:

openjdk version "1.8.0_212"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode)

Gradle:

------------------------------------------------------------
Gradle 6.0.1
------------------------------------------------------------

Build time:   2019-11-18 20:25:01 UTC
Revision:     fad121066a68c4701acd362daf4287a7c309a0f5

Kotlin:       1.3.50
Groovy:       2.5.8
Ant:          Apache Ant(TM) version 1.10.7 compiled on September 1 2019
JVM:          1.8.0_212 ( 25.212-b03)
OS:           Windows 10 10.0 amd64

Feature: progress status for validation

#Current behavior (if applicable)
when calling bag.isComplete(ignoreHiddenFiles) the system gives no feedback on where it is in the list of files to check

#Proposed behavior
When calling bag.isComplete(ignoreHiddenFiles) the system updates an object when it has completed checking a file with either success or failure.

#Why this feature is useful
This allows GUI tools to display a progress bar for the user. It would also be nice to be able to calculate and guess the time remaining but that might not be possible

#A small code example if possible

boolean ignoreHiddenFiles = true;
bag.isComplete(ignoreHiddenFiles, ProgressTrackerObject);

As far as what kind of interface the ProgressTrackerObject should have would depend on what GUI applications use (look into JavaFX and Swing)
See https://docs.oracle.com/javase/tutorial/uiswing/components/progress.html, https://docs.oracle.com/javase/8/docs/api/javax/swing/JProgressBar.html
and https://docs.oracle.com/javase/8/javafx/api/javafx/scene/control/ProgressBar.html

Include classpath exception in bagging AGPL license

We would like to use the bagging utility as a library dependency in some java software we are writing for digital preservation. We are an academic institution; we plan on releasing our developed software under the terms of a different license.

Please consider adding a GPL linking exception to your license to enable others to embed your package as a library dependency, then distribute their software under their own license terms. This exception is very common in licenses covering libraries.

See https://en.wikipedia.org/wiki/GPL_linking_exception, in particular, the GNU Classpath exception.
Here's some example text that could be included in the license: https://enterprise.dejacode.com/licenses/public/linking-exception-agpl-3.0/#license-text

Path encoding bug

I recently discovered that the BagIt 1.0 specification requires that CR, LF, and % in file paths within manifest files are percent-encoded, and that there isn't a single BagIt implementation that does this correctly. Implementations either only encode CR and LF but not % or they encode nothing.

This implementation only encodes CR and LF but not %. This is problematic because it would fail to validate BagIt 1.0 bags that include file paths containing % characters. Likewise, it would create bags that would fail BagIt 1.0 validation in the case that there are paths that naturally contain percent-encoded characters.

For example, let's say a bag contains the file data/file%0A1.txt. This file should be written to the manifest per the spec as data/file%250A1.txt. However, this implementation writes it as data/file%0A1.txt. This means, that when this implementation validates a properly constructed 1.0 bag it will look for the file data/file%250A1.txt which does not exist. Similarly, if another implementation that follows the spec attempts to validate a bag produced by this implementation, it would look for data/file\n1.txt, which does not exist.

It would seem desirable to me to move the ecosystem in the direction of properly implementing the 1.0 specification, while at the same acknowledging that there are a large number of 1.0 bags in existence that may then become invalid.

As such, it may be prudent to, when validating bags, fall back on a series of tests. You may want to first attempt to validate per the spec, and then, if a file cannot be found, attempt to locate it by either only decoding the CR and LF or leaving the path unchanged, ideally validating all of the files using the same method.

I have not examined fetch.txt implementations, but the same encoding requirements exist for paths in that file as well. This is potentially a thornier problem to address in a backward compatible way as it is unclear if the path data/file%250A1.txt is supposed to create data/file%250A1.txt (incorrect) or data/file%0A1.txt (correct).

Finally, I created a related ticket against the spec discussing this encoding problem, in particular how it breaks checksum utility compatibility.

Add documentation for each test in integration suite

Some of the integration test names do not give enough insight into what they are testing. More documentation (javadoc?) should be added so that it is clear to developers what each test is trying to prove or disprove.

For example: testInvalidBags() - com.github.jscancella.BagitSuiteComplanceTest should have an explanation that the complance test suite lists both valid and invalid examples for all versions of bagit. This test ensures that invalid bags are computed to be valid and cause a exception to be thrown.

Switch to github actions

Instead of using multiple CI/CD providers look into using github actions.
Need to build on Windows (7, 8, and 10 preferably), linux (ubuntu latest preferably), and Mac os.
Need to test various versions of Java (8, 9, 10, 11, 12, 13, 14, etc.)
Only 1 of these tests need to upload coverage results to our coverage provider (coveralls.io).

Cannot add Bag created files to Tagmanifest file in bagging version 4.0

When submitting an issue please include:

  • a small code example showing the incorrect behavior
    BagBuilder builder = new BagBuilder();

builder.addAlgorithm("md5")
.addMetadata("Bagging-Date", getCurrentISODate("yyyy-MM-dd")
.addPayloadFile(ingestFolder)
.addTagFile(ingestFolder + "/bagit.txt")
.addTagFile(ingestFolder + "/bag-info.txt")
.addTagFile(ingestFolder + "/manifest-md5.txt")

.bagLocation(exportFolder)
.write();

  • the expected behavior
    in tagmanifest-md5.txt
    eaa2c609ff6371712f623f5531945b44 bagit.txt
    063d7a5ee78b06a31a871dfd336ef6d1 bag-info.txt
    c30e579035a4f2b0fc38af0c43d31843 manifest-md5.txt

  • the actual behavior
    These code gives error: File not found exception.

  • the operating system being used, and its version
    debian 10

  • version of Bagging being used
    4.0

  • If available Attach all logs, and or output, and or screenshots

When submitting a feature request please include:

  • Current behavior (if applicable)
  • Proposed behavior
  • Why this feature is useful
  • A small code example if possible

When submitting a question please:

  • Please read the Frequently asked questions (FAQ) section of the README to see if your question has already been answered
  • Use complete sentences to form your question

Question from a newbie (all around!)

Hi there,

I am sort of tech savvy (enough for creating disasters!) but not a developer by any means. I cloned your repo and tried the library (I think!) It might be good for beginners like me to include more information on how to actually use this. Is the basic way to use this as follows?

  1. Open the command line and navigate to where the code lives, e.g., C:\Users\User\Documents\GitHub\bagging
  2. Write a command like:
    Path outputDir = Paths.get("C:\Users\User\Desktop\b_latour_iconoclash_images_bag"); bag.write(C:\Users\User\Desktop\b_latour_iconoclash_images)

I did this and it worked the first time. But when I tried it with different paths it didn't work... not sure what I'm doing wrong.

Wrong structure by saving payload files

When submitting an issue please include:

objobj71291926 structure:
objobj71291926/test.tif
objobj71291926/jpeg/test.jpeg

  • a small code example showing the incorrect behavior
    BagBuilder builder = new BagBuilder();
    builder.addAlgorithm("md5")
    .addMetadata("foo", "bar")
    .addPayloadFile(Paths.get(absolute_dir + "/import/obj71291926"))
    .addTagFile(Paths.get(absolute_dir + "/import/meta"))
    .bagLocation(Paths.get(absolute_dir + "/export/bag)
    .write();

  • the expected behavior
    Structure of Bag MUST be
    data // data/test.tif and data/jpeg/test.jpeg
    meta
    bag-info.txt
    bagit.txt
    manifest-md5.txt
    tagmanifest-md5.txt

  • the actual behavior
    Structure of Bag INCORRECT
    obj71291926/data // obj71291926/data/test.tif and obj71291926/data/jpeg/test.jpeg
    meta
    bag-info.txt
    bagit.txt
    manifest-md5.txt
    tagmanifest-md5.txt

// Hint: tree of obj71291926 must be copied and saved in data, but not the underdir (obj71291926) itself.

  • the operating system being used, and its version
    debian 10
  • version of Bagging being used
    4.3

Change to use concurrent versions of collections

We should probably change to use concurrent version of collections (hashmap, list, etc.) to prevent any issues with concurrency in the future. We need to make sure with those changes that we don't take a big performance hit

Hashing implementation is not thread-safe

Bagging uses an enum to represent hashers that are used to compute the digests of files that are added to bags. The problem is that since it's an enum the same hasher instance is used globally, which means that it will not compute the correct digest if multiple bags are created concurrently.

The following code demonstrates the problem:

    @Test
    public void concurrentTest() throws IOException {
        var bags = List.of(
                Files.createDirectories(Paths.get("/var/tmp/bag-1")),
                Files.createDirectories(Paths.get("/var/tmp/bag-2"))
        );
        var file = Files.writeString(Paths.get("/var/tmp/test.txt"), "a".repeat(1_000));

        var executor = Executors.newFixedThreadPool(bags.size());
        var phaser = new Phaser(bags.size() + 1);

        bags.forEach(bagDir -> {
            executor.execute(() -> {
                phaser.arriveAndAwaitAdvance();
                try {
                    new BagBuilder()
                            .addAlgorithm("sha256")
                            .bagLocation(bagDir)
                            .addPayloadFile(file)
                            .write();
                    phaser.arrive();
                } catch (Exception e) {
                    e.printStackTrace();
                }
            });
        });

        phaser.arriveAndAwaitAdvance();
        phaser.arriveAndAwaitAdvance();

        bags.forEach(bagDir -> {
            try {
                Bag.read(bagDir).justValidate();
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        });
    }

Use of mutation testing in bagging - Help needed

Hello there!

My name is Ana. I noted that you use the mutation testing tool Pit in the project.
I am a postdoctoral researcher at the University of Seville (Spain), and my colleagues and I are studying how mutation testing tools are used in practice. With this aim in mind, we have analysed over 3,500 public GitHub repositories using mutation testing tools, including yours! This work has recently been published in a journal paper available at https://link.springer.com/content/pdf/10.1007/s10664-022-10177-8.pdf.

To complete this study, we are asking for your help to understand better how mutation testing is used in practice, please! We would be extremely grateful if you could contribute to this study by answering a brief survey of 21 simple questions (no more than 6 minutes). This is the link to the questionnaire https://forms.gle/FvXNrimWAsJYC1zB9.

We apologize if you have already received message multiple times or if you have already had the opportunity to complete the survey. If you have already shared your feedback, we want to convey our appreciation, kindly disregard this message, and please accept our apologies for any inconvenience.

Drop me an e-mail if you have any questions or comments ([email protected]). Thank you very much in advance!!

Accented character in filename prevents payload verification on macOS

Steps to reproduce

  1. create a bag containing a filename with an accented character, e.g. contrôle.txt.
  2. read the bag with Bag.read().
  3. call bag.isValid() (ignoreHiddenFiles can be true or false, doesn't matter).

Expected behavior

  • isValid() returns true

Actual behavior

System information

  • bagging version: 4.4
  • OS: macOS 12.6.1 Monterey
  • Java version: OpenJDK Runtime Environment Zulu17.30+15-CA (build 17.0.1+12-LTS)

Notes

It looks like this may be a long-standing Java / macOS issue to do with how HFS+ does Unicode normalization.

If toString() conversion seems too risky, an alternative would be to compare the paths with Files.isSameFile(Path, Path), which does seem to admit they're the same. I'll see if I can create a PR.

gradle dependencies api vs implementation

Currently we list all dependencies as implementation, but after reading https://docs.gradle.org/current/userguide/java_library_plugin.html#sec:java_library_recognizing_dependencies it looks like some of our dependencies should be listed as api.

For example:
https://github.com/jscancella/bagging/blob/master/src/main/java/com/github/jscancella/conformance/profile/BagitProfileDeserializer.java lists classes from both jackson-databind and jackson-core in the public signatures, thus it should be listed as api and not implementation

upgrade to java 14

java 1.8 is EOL. Move to openJDK 14 and continually upgrade to latest version.

  • update gradle to output source as latest java and java 8
  • update CI configs to also run on latest java
  • modularize based on the various packages?

NPE is thrown when validating a bag against a BagIt profile

bagit-java version: 5.2.0
Operating System CentOS (Linux) 7

A null pointer exception is thrown when I attempt to validate a bag against a BagIt 1.3.0 profile without a Manifests-Required block. Looking at the class BagitProfileDeserializer (https://github.com/jscancella/bagging/blob/master/src/main/java/com/github/jscancella/conformance/profile/BagitProfileDeserializer.java#L183), it looks like all the parse* methods will throw NPEs in their for loops if the parsed block does not exist in the profile.

As I read the latest BagIt profile spec 1.3.0 (https://bagit-profiles.github.io/bagit-profiles-specification/), none of the blocks that throw NPEs if missing are required to be in a profile.

Given

  • I have a BagIt profile without a "Manfests-Required" block
  • And I have a Bag that references this profile

When

  • I validate the Bag against the profile

Then

  • The Bag should validate.

Log output:

$ java -jar target/bagmanager.jar verify --with-profile /var/tmp/testbag1
Verifying valid bag from contents at '/var/tmp/testbag1'
Verifying conformance to BagIt profile
Bag is not valid
java.lang.NullPointerException: null
        at gov.loc.repository.bagit.conformance.profile.BagitProfileDeserializer.parseManifestTypesRequired(BagitProfileDeserializer.java:128)
        at gov.loc.repository.bagit.conformance.profile.BagitProfileDeserializer.deserialize(BagitProfileDeserializer.java:47)
        at gov.loc.repository.bagit.conformance.profile.BagitProfileDeserializer.deserialize(BagitProfileDeserializer.java:24)
        at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4011)
        at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3058)
        at gov.loc.repository.bagit.conformance.BagProfileChecker.parseBagitProfile(BagProfileChecker.java:95)
        at gov.loc.repository.bagit.conformance.BagProfileChecker.bagConformsToProfile(BagProfileChecker.java:73)
[...]

bagit-profile-sample-v1_0_json.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.