Giter Club home page Giter Club logo

Comments (14)

vishaluk avatar vishaluk commented on June 10, 2024 1

Trying to get to the original point of this discussion - is it now possible to specify the compression while storing variables in .jld file?

from jld.jl.

eschnett avatar eschnett commented on June 10, 2024

HDF5 has a plethora of options that could/should be set. For example, one has to choose a chunk size, there are also two built-in compression types (szib and zlib), one can choose the compression level, and one can additionally use a shuffle filter to improve the compression ratio. There's a host of additional options. Thus, I would regard this boolean as a high-level interface (do you want to trade off computing time to save space?) that chooses "reasonable" defaults, leaving the details to another, more elaborate interface for experts.

from jld.jl.

avierck avatar avierck commented on June 10, 2024

Sure, I agree that having a boolean is way more comfortable - only that it consequently took me a lot of time figuring out that the issues I had trying to access my .jld file with h5py in python, were due to JLD using "blocs"-compression instead of a built-into-hdf5 compression like zlib.
The only solution for me right now (as apparently I haven't managed to compile/install the blocs filter correctly) is to not use compression at all, or writing my own saving method, both of which aren't great options either.
So while one might argue about what exactly a "reasonable default" for compressions settings should be (see all the options you've mentioned), I - not being an expert - am just looking for an easy option to have compression in a "compatibility" mode.

from jld.jl.

timholy avatar timholy commented on June 10, 2024

https://github.com/JuliaLang/HDF5.jl/blob/master/doc/hdf5.md#reading-and-writing-data

from jld.jl.

avierck avatar avierck commented on June 10, 2024

Ah, ok. So the idea would be to use HDF5.jl instead (with some compression-specific saving options) if I need interoperability with python (h5py) and use JLD.jl mainly (/only) for julia-I/O, correct?
So if I want the julia-specific benefits of JLD.jl over HDF5.jl I'd best not use compression at all (since I can't specify which type)?
If this is correct, then I guess my feature-request could be closed, as for most users having a boolean is sufficient for this "I'm using JLD.jl with compression for julia I/O only" use-case.

from jld.jl.

eschnett avatar eschnett commented on June 10, 2024

It may be good to have a "compatibility" flag, ensuring that the resulting HDF5 files can be read with a forest-and-meadow install of HDF5.

Did someone contact the HDF group regarding the Blosc compression library, and whether the respective decompression filter could become part of the standard? Currently, the 1.10 release of HDF5 is in alpha (or beta?) state, so this would be a good time to ask for such a feature.

from jld.jl.

timholy avatar timholy commented on June 10, 2024

There's MAT.jl to ensure matlab/julia compatibility. I've long urged someone to write an H5PY.jl so it's easy to read/write in python-compatible format.

from jld.jl.

eschnett avatar eschnett commented on June 10, 2024

HDF5 is a language-independent standard, so everything written as HDF5 can be read everywhere. The only exception is if you use non-standard filters; in this case, the reader obviously needs to have the same filter installed. So h5py can read JLD files just fine -- if one installs h5py with the Blosc filter.

To be "HDF5 compatible", one thus only needs to not use the Blosc filter; no explicit package is necessary.

I've asked the HDF forum about Blosc; my question is currently awaiting moderation.

from jld.jl.

timholy avatar timholy commented on June 10, 2024

HDF5 is a language-independent standard, so everything written as HDF5 can be read everywhere.

So why does JLD exist, then? And why do we need MAT.jl?

See, for example, JuliaIO/HDF5.jl#180

from jld.jl.

avierck avatar avierck commented on June 10, 2024

I was under the impression that one of the main advantages of HDF5 was compatibility, most importantly forward compatibility, no? So I'd presume that (most of the time) one should try to keep the actual data accessible, thus avoiding the creation of one-language-only binary files?
I do completely understand the reasoning for developing a file format for julia on top of hdf5, so I'd say that if the decision is ".jld is a dialect of hdf5, meant for julia only (or mainly)", it would be completely understandable; I was simply surprised that while JLD.jl apparently is a very convenient HDF5 storage package, its comfort lead me into a very deep hole trying (and failing) to install the blocs filter on three different platforms.
I guess the question is maybe not whether using blocs is good "HDF5 practice" or not, but rather whether it is feasible to have a more "future-proof" (at least as long as blocs does not come with HDF5) compression option in JLD, no?

from jld.jl.

eschnett avatar eschnett commented on June 10, 2024

One option that wasn't mentioned before would be to have a converter that takes a Blosc-compressed HDF5 file as input and produces a vanilla HDF5 file as output. The HDF5 utility h5repack can already do this (e.g. adding/removing compression); if the HDF5 Julia package contained a Blosc-enabled h5repart utility, then the data were easier to access.

from jld.jl.

timholy avatar timholy commented on June 10, 2024

I think it's better to specify the compression format than convert this file.

Anyway, this issue won't be resolved until somebody just submits a PR.

from jld.jl.

eschnett avatar eschnett commented on June 10, 2024

I ran some benchmarks. (If you've run benchmarks before, then you'll know that these numbers will be highly biased, but in the absence of other data, they can help you decide where to look for better data.)

I considered two cases:

  • case A: a 20,000^2 Float64 matrix with identical elements (highly compressible)
  • case B: a 10,000^2 Float64 matrix with random elements (basically incompressible)

I compared blosc, deflate (standard HDF5), and no compression.

File sizes:

  • case A: 18.63 MB, 5.49 MB, 3200 MB
  • case B: 800 MB, 800 MB, 800 MB

Result: Both deflate and blosc compress well. Since this is an extreme and unrealistic case, I don't want to judge the details of the compression ratio.

Time to save (best of 3):

  • case A: 2.1 s, 26.5 s, 29.3 s
  • case B: 9.0 s, 5.9 s, 5.8 s

Result: blosc is significantly faster than deflate (as advertised) if the data are compressible; for incompressible data, it seems slightly slower.

Time to load (best of 3):

  • case A: 2.4 s, 6.7 s, 29.3 s
  • case B: 0.54 s, 0.31 s, 0.29 s

Result: For compressible data, blosc is faster. For incompressible data, blosc is slower.

For my own work I draw the conclusion that Blosc has advantages over deflate, but these not a killer argument for Blosc. Blosc seems to be better than deflate, and I'm going to lobby in the HDF forum to include it by default (or to speed up their deflate algorithm).

from jld.jl.

StefanKarpinski avatar StefanKarpinski commented on June 10, 2024

@avierck, JLD is precisely a format on top of HDF5 for Julia data. If you want vanilla HDF5, the HDF5 package is what you want.

from jld.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.