Giter Club home page Giter Club logo

Comments (10)

timholy avatar timholy commented on May 30, 2024

Works for me. Things to try:

  • If you run the test/runtests.jl file, does other stuff fail for you too? (There's a UTF8 test in there.)
  • If you look at the file using h5dump, does everything look fine? This might help tell whether it's the reading or writing that's to blame.
  • If you use plain HDF5 are things OK?
f = h5open("/tmp/test.h5", "w")
a = utf8("Jon")
write(f, "utf8string", a)
close(f)

f = h5open("/tmp/test.h5")
b = read(f, "utf8string")
close(f)

from hdf5.jl.

malmaud avatar malmaud commented on May 30, 2024

Thanks Tim; I've tried all those steps but am still not sure what's wrong.

  • runtests.jl passes, using the latest HDF5.jl
  • h5dump seems to be outputting something reasonable, but I'm not fluent enough in HDF5 to be sure so I included the output at the bottom
  • Using plain HDF5 doesn't make any difference.
    Also: Saving a single UTF8 string works fine, as in the example of plain HDF5 you gave. The problem only appears when saving an array of UTF8 strings, which breaks reading whether or not plain HDF5 is used.
HDF5 "hdf5test" {
GROUP "/" {
   DATASET "_require" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  NULL
      DATA {
      }
      ATTRIBUTE "julia type" {
         DATATYPE  H5T_STRING {
            STRSIZE 30;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "Core.Array{Core.ASCIIString,1}"
         }
      }
   }
   DATASET "x" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET unknown_cset;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      DATA {
      (0): "Jon", "Tim"
      }
      ATTRIBUTE "julia type" {
         DATATYPE  H5T_STRING {
            STRSIZE 29;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "Core.Array{Core.UTF8String,1}"
         }
      }
   }
}
}

from hdf5.jl.

timholy avatar timholy commented on May 30, 2024

The saving looks like it's working (feel free to email me the resulting file if you want to be absolutely certain, but by eye it looks fine). Must be an issue on the loading. Am I correct in assuming there's no problem with an array of ASCIIStrings? If that's right, it's weird, because I don't see how the string type interacts with the array-reading.

from hdf5.jl.

malmaud avatar malmaud commented on May 30, 2024

OK, I've done further tests and determined the problem only happens when using HDF5 via an IJulia notebook - it works from the Julia CLI. And it does work fine with arrays of ASCIIStrings in either case. Maybe some subtle interaction on how IJulia handles unicode is responsible, although it's hard to see how.

EDIT: Nevermind, that's not what matters. What matters is whether I
include(".julia/HDF5/tests/jld.jl"), and then run my own tests (that works), or whether I directly type
using HDF5
using JLD

and then test, in which case reading unicode string arrays fails. So somehow jld.jl is setting up a context in which things works.

EDIT 2: In fact, the tests in jld.jl fail if included from a worker node. Like

addprocs(2)
@spawn include(".julia/HDF5/test/jld.jl") # Fails
include(".julia/HDF5/test/jld.jl") #wokrs

from hdf5.jl.

timholy avatar timholy commented on May 30, 2024

This is beginning to smell like a subtle Julia bug. I don't have any good ideas at the moment, but I'll sleep on it. Thanks for sticking with this.

from hdf5.jl.

malmaud avatar malmaud commented on May 30, 2024

Thanks for thinking about it.

from hdf5.jl.

timholy avatar timholy commented on May 30, 2024

Since I can't reproduce this on my own machine (I hate these platform-specific bugs), I'll have to ask you for more tests. I suspect the next step is to insert @show statements before line 1031 of plain.jl (the next-to-the-error line in your backtrace) to inspect all the variables. Alternatively, wrap that entire function in @debug from Debug.jl and step through it. Here's an example of what I get when I do that:

julia> using HDF5, JLD

julia> @load "/tmp/test.jld"
sz => ()
len => 1
objtype => HDF5 datatype 50331964
isvar => true
ilen => 8
S => ASCIIString
sz => (2,)
len => 2
objtype => HDF5 datatype 50331969
isvar => true
ilen => 8
S => UTF8String
obj => HDF5 dataset: /x (file: /tmp/test.jld)
memtype_id => 50331970
buf => [Ptr{Uint8} @0x0000000002070b80,Ptr{Uint8} @0x0000000002070b80]
2-element Array{UTF8String,1}:
 "Jon"
 "Tim"

Notice it's the second call to this function that reading "x". The ilen input is not used in this case.

Here's what my modification of that function looked like:

function read{S<:ByteString}(obj::DatasetOrAttribute, ::Type{Array{S}})
    local isvar::Bool
    local ret::Array{S}
    sz = size(obj)
    len = prod(sz)
    objtype = datatype(obj)
    try
        isvar = h5t_is_variable_str(objtype.id)
        ilen = int(h5t_get_size(objtype.id))
    finally
        close(objtype)
    end
    @show sz
    @show len
    @show objtype
    @show isvar
    @show ilen
    @show S
    memtype_id = h5t_copy(H5T_C_S1)
    if isempty(sz)
        ret = Array(S, 0)
    else
        ret = Array(S, sz...)
        if isvar
            # Variable-length
            buf = Array(Ptr{Uint8}, len)
            h5t_set_size(memtype_id, H5T_VARIABLE)
            @show obj
            @show memtype_id
            @show buf
            readarray(obj, memtype_id, buf)
            # FIXME? Who owns the memory for each string? Will Julia free it?
            for i = 1:len
                ret[i] = bytestring(buf[i])
            end
        else
            # Fixed length
            ilen += 1  # for null terminator
            buf = Array(Uint8, len*ilen)
            h5t_set_size(memtype_id, ilen)
            readarray(obj, memtype_id, buf)
            p = convert(Ptr{Uint8}, buf)
            for i = 1:len
                ret[i] = bytestring(p)
                p += ilen
            end
        end
    end
    h5t_close(memtype_id)
    ret
end

from hdf5.jl.

malmaud avatar malmaud commented on May 30, 2024

I get the following from those @show statements. It looks OK except my objtype has HDF5 datatype '50331963 and memtype_id '50331964', while yours has datatype '50331969' and memtype '50331970'. Maybe that's pertinent.

sz => (2,)
len => 2
objtype => HDF5 datatype 50331963
isvar => true
ilen => 8
S => UTF8String
obj => HDF5 dataset: /x (file: test.jld)
memtype_id => 50331964
buf => [Ptr{Uint8} @0x00007f9738bf0010,Ptr{Uint8} @0x00007f973b47f890]

from hdf5.jl.

timholy avatar timholy commented on May 30, 2024

I'm not certain one should take the exact numbers too seriously, but nevertheless I think we may have a culprit. Try the version below and see what you get:

function read{S<:ByteString}(obj::DatasetOrAttribute, ::Type{Array{S}})
    local isvar::Bool
    local ret::Array{S}
    sz = size(obj)
    len = prod(sz)
    objtype = datatype(obj)
    try
        isvar = h5t_is_variable_str(objtype.id)
        ilen = int(h5t_get_size(objtype.id))
    finally
        close(objtype)
    end
    memtype_id = h5t_copy(H5T_C_S1)
    # h5t_set_cset(memtype_id, cset(S))     # try me!
    if isempty(sz)
        ret = Array(S, 0)
    else
        ret = Array(S, sz...)
        if isvar
            # Variable-length
            buf = Array(Ptr{Uint8}, len)
            h5t_set_size(memtype_id, H5T_VARIABLE)
            @show hdf5_to_julia_eltype(objtype)
            @show hdf5_to_julia_eltype(HDF5Datatype(memtype_id))
            readarray(obj, memtype_id, buf)
            # FIXME? Who owns the memory for each string? Will Julia free it?
            for i = 1:len
                ret[i] = bytestring(buf[i])
            end
        else
            # Fixed length
            ilen += 1  # for null terminator
            buf = Array(Uint8, len*ilen)
            h5t_set_size(memtype_id, ilen)
            readarray(obj, memtype_id, buf)
            p = convert(Ptr{Uint8}, buf)
            for i = 1:len
                ret[i] = bytestring(p)
                p += ilen
            end
        end
    end
    h5t_close(memtype_id)
    ret
end

My prediction is that you'll see one is ASCII and one is UTF8. Then, if you uncomment that line I added with h5t_set_cset(memtype_id, cset(S)), that may fix the problem.

from hdf5.jl.

malmaud avatar malmaud commented on May 30, 2024

That did indeed fix it. Thanks for working through this with me.

from hdf5.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.