jscancella / bagging Goto Github PK
View Code? Open in Web Editor NEWA clean and easy to use implementation of the BagIt specification
License: Other
A clean and easy to use implementation of the BagIt specification
License: Other
@jscancella Thanks for maintaining this library!
I was just working on using it to create some bags. The problem I encountered is that it is designed to create a bag by making copies of all of the payload files. For my use case, I do not want this duplication.
I browsed the code, and it looks like it would be fairly easy to support either creating bags in place or optionally allowing moves instead of copies.
For creating bags in place, I believe the only modification that would be needed is a check to see if the entry
and newEntry
physical locations are the same (here) and do nothing if they are. However, I see that you have a warning here that would fire if you did try to create the bag in place. The warning text suggests that it won't create the bag, but it doesn't actually appear to short circuit.
If you don't like the idea of creating a bag in place, then perhaps moves could be supported? You could have an optional option either at the builder level or at the payload file level to indicate if files should be moved or copied. A slightly larger change, but perhaps more in line with how you intend the builder to be used?
If you are amenable to either of these ideas, I'd be happy to send you a PR, if you'd like.
Example:
var src = Files.createDirectories(Paths.get("/var/tmp/foo"));
var dst = Paths.get("/var/tmp/foo-bag");
Files.writeString(src.resolve("file.txt"), "bar");
new BagBuilder()
.addAlgorithm("md5")
.addPayloadFile(src)
.bagLocation(dst)
.write();
try (var walk = Files.walk(dst)) {
walk.forEach(System.out::println);
}
Expected output:
/var/tmp/foo-bag
/var/tmp/foo-bag/data/foo
/var/tmp/foo-bag/data/foo/file.txt
/var/tmp/foo-bag/manifest-md5.txt
/var/tmp/foo-bag/bagit.txt
/var/tmp/foo-bag/tagmanifest-md5.txt
Actual output:
/var/tmp/foo-bag
/var/tmp/foo-bag/foo
/var/tmp/foo-bag/foo/file.txt
/var/tmp/foo-bag/manifest-md5.txt
/var/tmp/foo-bag/bagit.txt
/var/tmp/foo-bag/tagmanifest-md5.txt
The issue appears to be that ManifestBuilderVistor
does not use the relative path when compiling ManifestEntries
.
bagit-profiles/bagit-profiles-specification#35 (comment)
because of these changes we probably need to update the bagging profile stuff to conform
Because Jfrog bintray is going away, we need to migrate to using maven central.
Gradle 6 has just been released. See https://docs.gradle.org/6.0/release-notes.html for changes
using something like https://github.com/renovatebot/renovate to submit pull requests for dependency upgrades.
Currently the domain objects allow for some changes after creation. Create a pull request for review that changes the domain objects (like Bag) to be completely immutable and throw modification error when trying to modify them.
This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.
These updates have all been created already. Click a checkbox below to force a retry/rebase of any.
.circleci/config.yml
cimg/openjdk 17.0.11
cimg/openjdk 21.0
code-quality.gradle
pmd 6.55.0
eclipse.gradle
maven-central.gradle
message-bundle.gradle
settings.gradle
build.gradle
com.github.nbaztec.coveralls-jacoco 1.2.20
de.aaschmid.cpd 3.4
org.ajoberstar.grgit 5.2.2
com.github.spotbugs 6.0.20
info.solidsoft.pitest 1.15.0
com.dorongold.task-tree 2.1.1
org.slf4j:slf4j-api 2.0.16
com.fasterxml.jackson.core:jackson-core 2.17.2
com.fasterxml.jackson.core:jackson-databind 2.17.2
org.junit.jupiter:junit-jupiter 5.11.0
org.springframework.boot:spring-boot-starter-logging 3.3.2
org.bouncycastle:bcprov-jdk15on 1.70
org.kamranzafar:jtar 2.3
gradle/wrapper/gradle-wrapper.properties
gradle 8.10
look at adopting https://github.com/allegro/axion-release-plugin
a small code example showing the incorrect behavior
BagBuilder builder = new BagBuilder();
builder.addAlgorithm("md5")
.addMetadata("foo", "bar")
.addPayloadFile(payloadFiles)
.addTagFile(Paths.get(absolute_dir + "/import/meta"))
.bagLocation(Paths.get(absolute_dir + "/export/bag)
.write();
the expected behavior
Structure of Bag MUST be
data
meta (with files for example: mods.xml, rights.xml)
bag-info.txt
bagit.txt
manifest-md5.txt
tagmanifest-md5.txt (Must include: xxxxxxxxxxxxxxxxxxxxxxxx meta/mods.xml)
(Must include: xxxxxxxxxxxxxxxxxxxxxxxx meta/rights.xml)
the actual behavior
Structure of Bag INCORRECT
data
mods.xml
rights.xml
bag-info.txt
bagit.txt
manifest-md5.txt
tagmanifest-md5.txt
the operating system being used, and its version
debian 10
version of Bagging being used
4.2
Several of the Badges listed in the README.md file are missing or displaying incorrect information (like java doc not actually displaying any javadocs). Investigate why and submit pull request for fxing.
https://github.com/jscancella/bagging/blob/master/message-bundle.gradle#L9 should probably depend on the jar
task output instead of defining a directory that isn't used.
The current PMD ruleset has many exceptions. We should look over them and see what ones should really be excepted and document why (comments in the xml?).
For example, we probably don't want an exception for MethodArgumentCouldBeFinal
, but we do want an exception for DataflowAnomalyAnalysis
since it has a bug in it that doesn't recognize variables created in for-each loops.
Several people have requested that the ability to add or remove files to/from a bag easily be added. Currently the programmer has to access the Bag's internal map of checksums and add the file to them, calculating the checksum, and then write it to disk.
Investigate adding the Builder pattern (i.e. Bag.add(FILE1).add(FILE2).build() to create a new Bag object that can then be written to disk.
There is an error with this repository's Renovate configuration that needs to be fixed. As a precaution, Renovate will stop PRs until it is resolved.
Location: renovate.json
Error type: The renovate configuration file contains some invalid settings
Message: Invalid configuration option: packageRules[0].matchDatasource
After cloning the repository into a fresh local repo, I attempt to build the project while on the master
branch:
./gradlew clean check
The following Javadoc generation errors occur:
> Task :javadoc
C:\Projects\bagging\src\main\java\com\github\jscancella\conformance\internal\LargeBagChecker.java:35: error: invalid end tag: </br>
* Check if a bag is "large", which is: </br>
^
C:\Projects\bagging\src\main\java\com\github\jscancella\verify\internal\BagitTextFileVerifier.java:23: warning: no description for @param
* @param bag
^
C:\Projects\bagging\src\main\java\com\github\jscancella\verify\internal\BagitTextFileVerifier.java:24: warning: no description for @throws
* @throws IOException
^
1 error
2 warnings
> Task :javadoc FAILED
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':javadoc'.
> Javadoc generation failed. Generated Javadoc options file (useful for troubleshooting): 'C:\Projects\bagging\build\tmp\javadoc\javadoc.options'
OS: Windows 10 Enterprise (64 bit)
Java:
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode)
Gradle:
------------------------------------------------------------
Gradle 6.0.1
------------------------------------------------------------
Build time: 2019-11-18 20:25:01 UTC
Revision: fad121066a68c4701acd362daf4287a7c309a0f5
Kotlin: 1.3.50
Groovy: 2.5.8
Ant: Apache Ant(TM) version 1.10.7 compiled on September 1 2019
JVM: 1.8.0_212 ( 25.212-b03)
OS: Windows 10 10.0 amd64
#Current behavior (if applicable)
when calling bag.isComplete(ignoreHiddenFiles)
the system gives no feedback on where it is in the list of files to check
#Proposed behavior
When calling bag.isComplete(ignoreHiddenFiles)
the system updates an object when it has completed checking a file with either success or failure.
#Why this feature is useful
This allows GUI tools to display a progress bar for the user. It would also be nice to be able to calculate and guess the time remaining but that might not be possible
#A small code example if possible
boolean ignoreHiddenFiles = true;
bag.isComplete(ignoreHiddenFiles, ProgressTrackerObject);
As far as what kind of interface the ProgressTrackerObject
should have would depend on what GUI applications use (look into JavaFX and Swing)
See https://docs.oracle.com/javase/tutorial/uiswing/components/progress.html, https://docs.oracle.com/javase/8/docs/api/javax/swing/JProgressBar.html
and https://docs.oracle.com/javase/8/javafx/api/javafx/scene/control/ProgressBar.html
since com.github.kt3k.coveralls says it doesn't have time to maintain we should switch to https://github.com/nbaztec/coveralls-jacoco-gradle-plugin
We would like to use the bagging utility as a library dependency in some java software we are writing for digital preservation. We are an academic institution; we plan on releasing our developed software under the terms of a different license.
Please consider adding a GPL linking exception to your license to enable others to embed your package as a library dependency, then distribute their software under their own license terms. This exception is very common in licenses covering libraries.
See https://en.wikipedia.org/wiki/GPL_linking_exception, in particular, the GNU Classpath exception.
Here's some example text that could be included in the license: https://enterprise.dejacode.com/licenses/public/linking-exception-agpl-3.0/#license-text
I recently discovered that the BagIt 1.0 specification requires that CR
, LF
, and %
in file paths within manifest files are percent-encoded, and that there isn't a single BagIt implementation that does this correctly. Implementations either only encode CR
and LF
but not %
or they encode nothing.
This implementation only encodes CR
and LF
but not %
. This is problematic because it would fail to validate BagIt 1.0 bags that include file paths containing %
characters. Likewise, it would create bags that would fail BagIt 1.0 validation in the case that there are paths that naturally contain percent-encoded characters.
For example, let's say a bag contains the file data/file%0A1.txt
. This file should be written to the manifest per the spec as data/file%250A1.txt
. However, this implementation writes it as data/file%0A1.txt
. This means, that when this implementation validates a properly constructed 1.0 bag it will look for the file data/file%250A1.txt
which does not exist. Similarly, if another implementation that follows the spec attempts to validate a bag produced by this implementation, it would look for data/file\n1.txt
, which does not exist.
It would seem desirable to me to move the ecosystem in the direction of properly implementing the 1.0 specification, while at the same acknowledging that there are a large number of 1.0 bags in existence that may then become invalid.
As such, it may be prudent to, when validating bags, fall back on a series of tests. You may want to first attempt to validate per the spec, and then, if a file cannot be found, attempt to locate it by either only decoding the CR
and LF
or leaving the path unchanged, ideally validating all of the files using the same method.
I have not examined fetch.txt
implementations, but the same encoding requirements exist for paths in that file as well. This is potentially a thornier problem to address in a backward compatible way as it is unclear if the path data/file%250A1.txt
is supposed to create data/file%250A1.txt
(incorrect) or data/file%0A1.txt
(correct).
Finally, I created a related ticket against the spec discussing this encoding problem, in particular how it breaks checksum utility compatibility.
When viewing the javadoc documentation for ENUMs many of them don't include descriptions. For example: https://www.javadoc.io/doc/com.github.jscancella/bagging/2.0/com/github/jscancella/conformance/BagitWarning.html
We should add human descriptions that detail what each value means. For example:
/**
* When a bag stores another bag in the data directory
**/
BAG_WITHIN_A_BAG
Some of the integration test names do not give enough insight into what they are testing. More documentation (javadoc?) should be added so that it is clear to developers what each test is trying to prove or disprove.
For example: testInvalidBags() - com.github.jscancella.BagitSuiteComplanceTest
should have an explanation that the complance test suite lists both valid and invalid examples for all versions of bagit. This test ensures that invalid bags are computed to be valid and cause a exception to be thrown.
The documentation could always be improved. Please be sure to first read https://www.divio.com/blog/documentation/ and try and follow the recommendations for great documentation
Some ideas on where to look to improve:
//
or /*
using your favorite tool - github search doesn't allow you to search for them)Instead of using multiple CI/CD providers look into using github actions.
Need to build on Windows (7, 8, and 10 preferably), linux (ubuntu latest preferably), and Mac os.
Need to test various versions of Java (8, 9, 10, 11, 12, 13, 14, etc.)
Only 1 of these tests need to upload coverage results to our coverage provider (coveralls.io).
Currently using 4.x series. We should upgrade before it becomes too painful to maintain the current version.
builder.addAlgorithm("md5")
.addMetadata("Bagging-Date", getCurrentISODate("yyyy-MM-dd")
.addPayloadFile(ingestFolder)
.addTagFile(ingestFolder + "/bagit.txt")
.addTagFile(ingestFolder + "/bag-info.txt")
.addTagFile(ingestFolder + "/manifest-md5.txt")
.bagLocation(exportFolder)
.write();
the expected behavior
in tagmanifest-md5.txt
eaa2c609ff6371712f623f5531945b44 bagit.txt
063d7a5ee78b06a31a871dfd336ef6d1 bag-info.txt
c30e579035a4f2b0fc38af0c43d31843 manifest-md5.txt
the actual behavior
These code gives error: File not found exception.
the operating system being used, and its version
debian 10
version of Bagging being used
4.0
If available Attach all logs, and or output, and or screenshots
Using the build in JUnit assertEquals() leaves much to be desired when the assertion fails. Take a look at replacing it with AssertJ from https://joel-costigliola.github.io/assertj/
Hi there,
I am sort of tech savvy (enough for creating disasters!) but not a developer by any means. I cloned your repo and tried the library (I think!) It might be good for beginners like me to include more information on how to actually use this. Is the basic way to use this as follows?
I did this and it worked the first time. But when I tried it with different paths it didn't work... not sure what I'm doing wrong.
objobj71291926 structure:
objobj71291926/test.tif
objobj71291926/jpeg/test.jpeg
a small code example showing the incorrect behavior
BagBuilder builder = new BagBuilder();
builder.addAlgorithm("md5")
.addMetadata("foo", "bar")
.addPayloadFile(Paths.get(absolute_dir + "/import/obj71291926"))
.addTagFile(Paths.get(absolute_dir + "/import/meta"))
.bagLocation(Paths.get(absolute_dir + "/export/bag)
.write();
the expected behavior
Structure of Bag MUST be
data // data/test.tif and data/jpeg/test.jpeg
meta
bag-info.txt
bagit.txt
manifest-md5.txt
tagmanifest-md5.txt
the actual behavior
Structure of Bag INCORRECT
obj71291926/data // obj71291926/data/test.tif and obj71291926/data/jpeg/test.jpeg
meta
bag-info.txt
bagit.txt
manifest-md5.txt
tagmanifest-md5.txt
// Hint: tree of obj71291926 must be copied and saved in data, but not the underdir (obj71291926) itself.
We should probably change to use concurrent version of collections (hashmap, list, etc.) to prevent any issues with concurrency in the future. We need to make sure with those changes that we don't take a big performance hit
Bagging uses an enum to represent hashers that are used to compute the digests of files that are added to bags. The problem is that since it's an enum the same hasher instance is used globally, which means that it will not compute the correct digest if multiple bags are created concurrently.
The following code demonstrates the problem:
@Test
public void concurrentTest() throws IOException {
var bags = List.of(
Files.createDirectories(Paths.get("/var/tmp/bag-1")),
Files.createDirectories(Paths.get("/var/tmp/bag-2"))
);
var file = Files.writeString(Paths.get("/var/tmp/test.txt"), "a".repeat(1_000));
var executor = Executors.newFixedThreadPool(bags.size());
var phaser = new Phaser(bags.size() + 1);
bags.forEach(bagDir -> {
executor.execute(() -> {
phaser.arriveAndAwaitAdvance();
try {
new BagBuilder()
.addAlgorithm("sha256")
.bagLocation(bagDir)
.addPayloadFile(file)
.write();
phaser.arrive();
} catch (Exception e) {
e.printStackTrace();
}
});
});
phaser.arriveAndAwaitAdvance();
phaser.arriveAndAwaitAdvance();
bags.forEach(bagDir -> {
try {
Bag.read(bagDir).justValidate();
} catch (IOException e) {
throw new RuntimeException(e);
}
});
}
many of the unit tests are based on the "old style" of JUnit 4 but using JUnit 5 syntax. Take a look at https://98elements.com/blog/improve-your-tests-with-junit-5/ and update tests as appropriate
Hello there!
My name is Ana. I noted that you use the mutation testing tool Pit in the project.
I am a postdoctoral researcher at the University of Seville (Spain), and my colleagues and I are studying how mutation testing tools are used in practice. With this aim in mind, we have analysed over 3,500 public GitHub repositories using mutation testing tools, including yours! This work has recently been published in a journal paper available at https://link.springer.com/content/pdf/10.1007/s10664-022-10177-8.pdf.
To complete this study, we are asking for your help to understand better how mutation testing is used in practice, please! We would be extremely grateful if you could contribute to this study by answering a brief survey of 21 simple questions (no more than 6 minutes). This is the link to the questionnaire https://forms.gle/FvXNrimWAsJYC1zB9.
We apologize if you have already received message multiple times or if you have already had the opportunity to complete the survey. If you have already shared your feedback, we want to convey our appreciation, kindly disregard this message, and please accept our apologies for any inconvenience.
Drop me an e-mail if you have any questions or comments ([email protected]). Thank you very much in advance!!
Have someone look over and review README.md for spelling, grammer, and other mistakes.
setup a crowdin account and ask for translation help similar to https://crowdin.com/project/bagit-java
Take a look at adding mutation testing using http://pitest.org/
contrôle.txt
.Bag.read()
.bag.isValid()
(ignoreHiddenFiles
can be true or false, doesn't matter).isValid()
returns trueisValid()
throws FileNotInManifestException
bagging
version: 4.4It looks like this may be a long-standing Java / macOS issue to do with how HFS+ does Unicode normalization.
If toString()
conversion seems too risky, an alternative would be to compare the paths with Files.isSameFile(Path, Path)
, which does seem to admit they're the same. I'll see if I can create a PR.
Currently we list all dependencies as implementation
, but after reading https://docs.gradle.org/current/userguide/java_library_plugin.html#sec:java_library_recognizing_dependencies it looks like some of our dependencies should be listed as api
.
For example:
https://github.com/jscancella/bagging/blob/master/src/main/java/com/github/jscancella/conformance/profile/BagitProfileDeserializer.java lists classes from both jackson-databind
and jackson-core
in the public signatures, thus it should be listed as api
and not implementation
Effectively, AGPL limits the use of the library to other GPL/AGPL projects, which most existing Java digipres projects aren't.
java 1.8 is EOL. Move to openJDK 14 and continually upgrade to latest version.
look into using something like https://xebia.com/blog/property-based-testing-java-junit-quickcheck-part-1-basics/ to add generative property based testing to make code even more robust.
bagit-java version: 5.2.0
Operating System CentOS (Linux) 7
A null pointer exception is thrown when I attempt to validate a bag against a BagIt 1.3.0 profile without a Manifests-Required block. Looking at the class BagitProfileDeserializer (https://github.com/jscancella/bagging/blob/master/src/main/java/com/github/jscancella/conformance/profile/BagitProfileDeserializer.java#L183), it looks like all the parse* methods will throw NPEs in their for loops if the parsed block does not exist in the profile.
As I read the latest BagIt profile spec 1.3.0 (https://bagit-profiles.github.io/bagit-profiles-specification/), none of the blocks that throw NPEs if missing are required to be in a profile.
Given
When
Then
Log output:
$ java -jar target/bagmanager.jar verify --with-profile /var/tmp/testbag1
Verifying valid bag from contents at '/var/tmp/testbag1'
Verifying conformance to BagIt profile
Bag is not valid
java.lang.NullPointerException: null
at gov.loc.repository.bagit.conformance.profile.BagitProfileDeserializer.parseManifestTypesRequired(BagitProfileDeserializer.java:128)
at gov.loc.repository.bagit.conformance.profile.BagitProfileDeserializer.deserialize(BagitProfileDeserializer.java:47)
at gov.loc.repository.bagit.conformance.profile.BagitProfileDeserializer.deserialize(BagitProfileDeserializer.java:24)
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4011)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3058)
at gov.loc.repository.bagit.conformance.BagProfileChecker.parseBagitProfile(BagProfileChecker.java:95)
at gov.loc.repository.bagit.conformance.BagProfileChecker.bagConformsToProfile(BagProfileChecker.java:73)
[...]
looking at https://coveralls.io/builds/19944497/source?filename=src/main/java/com/github/jscancella/hash/BagitChecksumNameMapping.java
the unit test coverage is missing cases for when the algorithm doesn't exist on the machine.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.