Comments (10)
Works for me. Things to try:
- If you run the
test/runtests.jl
file, does other stuff fail for you too? (There's a UTF8 test in there.) - If you look at the file using h5dump, does everything look fine? This might help tell whether it's the reading or writing that's to blame.
- If you use plain HDF5 are things OK?
f = h5open("/tmp/test.h5", "w")
a = utf8("Jon")
write(f, "utf8string", a)
close(f)
f = h5open("/tmp/test.h5")
b = read(f, "utf8string")
close(f)
from hdf5.jl.
Thanks Tim; I've tried all those steps but am still not sure what's wrong.
- runtests.jl passes, using the latest HDF5.jl
- h5dump seems to be outputting something reasonable, but I'm not fluent enough in HDF5 to be sure so I included the output at the bottom
- Using plain HDF5 doesn't make any difference.
Also: Saving a single UTF8 string works fine, as in the example of plain HDF5 you gave. The problem only appears when saving an array of UTF8 strings, which breaks reading whether or not plain HDF5 is used.
HDF5 "hdf5test" {
GROUP "/" {
DATASET "_require" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE NULL
DATA {
}
ATTRIBUTE "julia type" {
DATATYPE H5T_STRING {
STRSIZE 30;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "Core.Array{Core.ASCIIString,1}"
}
}
}
DATASET "x" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET unknown_cset;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
DATA {
(0): "Jon", "Tim"
}
ATTRIBUTE "julia type" {
DATATYPE H5T_STRING {
STRSIZE 29;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "Core.Array{Core.UTF8String,1}"
}
}
}
}
}
from hdf5.jl.
The saving looks like it's working (feel free to email me the resulting file if you want to be absolutely certain, but by eye it looks fine). Must be an issue on the loading. Am I correct in assuming there's no problem with an array of ASCIIString
s? If that's right, it's weird, because I don't see how the string type interacts with the array-reading.
from hdf5.jl.
OK, I've done further tests and determined the problem only happens when using HDF5 via an IJulia notebook - it works from the Julia CLI. And it does work fine with arrays of ASCIIStrings in either case. Maybe some subtle interaction on how IJulia handles unicode is responsible, although it's hard to see how.
EDIT: Nevermind, that's not what matters. What matters is whether I
include(".julia/HDF5/tests/jld.jl"), and then run my own tests (that works), or whether I directly type
using HDF5
using JLD
and then test, in which case reading unicode string arrays fails. So somehow jld.jl is setting up a context in which things works.
EDIT 2: In fact, the tests in jld.jl fail if included from a worker node. Like
addprocs(2)
@spawn include(".julia/HDF5/test/jld.jl") # Fails
include(".julia/HDF5/test/jld.jl") #wokrs
from hdf5.jl.
This is beginning to smell like a subtle Julia bug. I don't have any good ideas at the moment, but I'll sleep on it. Thanks for sticking with this.
from hdf5.jl.
Thanks for thinking about it.
from hdf5.jl.
Since I can't reproduce this on my own machine (I hate these platform-specific bugs), I'll have to ask you for more tests. I suspect the next step is to insert @show
statements before line 1031 of plain.jl
(the next-to-the-error line in your backtrace) to inspect all the variables. Alternatively, wrap that entire function in @debug
from Debug.jl and step through it. Here's an example of what I get when I do that:
julia> using HDF5, JLD
julia> @load "/tmp/test.jld"
sz => ()
len => 1
objtype => HDF5 datatype 50331964
isvar => true
ilen => 8
S => ASCIIString
sz => (2,)
len => 2
objtype => HDF5 datatype 50331969
isvar => true
ilen => 8
S => UTF8String
obj => HDF5 dataset: /x (file: /tmp/test.jld)
memtype_id => 50331970
buf => [Ptr{Uint8} @0x0000000002070b80,Ptr{Uint8} @0x0000000002070b80]
2-element Array{UTF8String,1}:
"Jon"
"Tim"
Notice it's the second call to this function that reading "x"
. The ilen
input is not used in this case.
Here's what my modification of that function looked like:
function read{S<:ByteString}(obj::DatasetOrAttribute, ::Type{Array{S}})
local isvar::Bool
local ret::Array{S}
sz = size(obj)
len = prod(sz)
objtype = datatype(obj)
try
isvar = h5t_is_variable_str(objtype.id)
ilen = int(h5t_get_size(objtype.id))
finally
close(objtype)
end
@show sz
@show len
@show objtype
@show isvar
@show ilen
@show S
memtype_id = h5t_copy(H5T_C_S1)
if isempty(sz)
ret = Array(S, 0)
else
ret = Array(S, sz...)
if isvar
# Variable-length
buf = Array(Ptr{Uint8}, len)
h5t_set_size(memtype_id, H5T_VARIABLE)
@show obj
@show memtype_id
@show buf
readarray(obj, memtype_id, buf)
# FIXME? Who owns the memory for each string? Will Julia free it?
for i = 1:len
ret[i] = bytestring(buf[i])
end
else
# Fixed length
ilen += 1 # for null terminator
buf = Array(Uint8, len*ilen)
h5t_set_size(memtype_id, ilen)
readarray(obj, memtype_id, buf)
p = convert(Ptr{Uint8}, buf)
for i = 1:len
ret[i] = bytestring(p)
p += ilen
end
end
end
h5t_close(memtype_id)
ret
end
from hdf5.jl.
I get the following from those @show statements. It looks OK except my objtype has HDF5 datatype '50331963 and memtype_id '50331964', while yours has datatype '50331969' and memtype '50331970'. Maybe that's pertinent.
sz => (2,)
len => 2
objtype => HDF5 datatype 50331963
isvar => true
ilen => 8
S => UTF8String
obj => HDF5 dataset: /x (file: test.jld)
memtype_id => 50331964
buf => [Ptr{Uint8} @0x00007f9738bf0010,Ptr{Uint8} @0x00007f973b47f890]
from hdf5.jl.
I'm not certain one should take the exact numbers too seriously, but nevertheless I think we may have a culprit. Try the version below and see what you get:
function read{S<:ByteString}(obj::DatasetOrAttribute, ::Type{Array{S}})
local isvar::Bool
local ret::Array{S}
sz = size(obj)
len = prod(sz)
objtype = datatype(obj)
try
isvar = h5t_is_variable_str(objtype.id)
ilen = int(h5t_get_size(objtype.id))
finally
close(objtype)
end
memtype_id = h5t_copy(H5T_C_S1)
# h5t_set_cset(memtype_id, cset(S)) # try me!
if isempty(sz)
ret = Array(S, 0)
else
ret = Array(S, sz...)
if isvar
# Variable-length
buf = Array(Ptr{Uint8}, len)
h5t_set_size(memtype_id, H5T_VARIABLE)
@show hdf5_to_julia_eltype(objtype)
@show hdf5_to_julia_eltype(HDF5Datatype(memtype_id))
readarray(obj, memtype_id, buf)
# FIXME? Who owns the memory for each string? Will Julia free it?
for i = 1:len
ret[i] = bytestring(buf[i])
end
else
# Fixed length
ilen += 1 # for null terminator
buf = Array(Uint8, len*ilen)
h5t_set_size(memtype_id, ilen)
readarray(obj, memtype_id, buf)
p = convert(Ptr{Uint8}, buf)
for i = 1:len
ret[i] = bytestring(p)
p += ilen
end
end
end
h5t_close(memtype_id)
ret
end
My prediction is that you'll see one is ASCII and one is UTF8. Then, if you uncomment that line I added with h5t_set_cset(memtype_id, cset(S))
, that may fix the problem.
from hdf5.jl.
That did indeed fix it. Thanks for working through this with me.
from hdf5.jl.
Related Issues (20)
- Test failures in h5a_iterate HOT 1
- Changed requirements in HDF5_jll's `libhdf5.so` for `libcurl.so`? HOT 8
- Can't get HDF5.jl work with Julia running in docker (julia:1.8-alpine3.17) - can't find libmpi.so.12 HOT 6
- Add mid/high level interface for HDF5 Dimension Scale HOT 1
- Writing scalar datasets of compound types HOT 3
- freeze when `hdf5_type_id` on self-referential datatype HOT 17
- Get rid of HISTORY.md? HOT 1
- Segfault when writing variable length string as attribute HOT 8
- Feature request - add support for SparseMatrixCSC HOT 1
- Support szip (freely) HOT 5
- Installing HDF5.jl on ARM M1 HOT 5
- HDF5.jl triggers segfault in ccall with openmp+clang(m1) with julia 1.10 HOT 20
- Inconsistent writing of complex data inside compound type HOT 1
- `set_libraries!()` fails on fresh install HOT 7
- h5_is_library_threadsafe() gives unreliable results due to unspecified initial value HOT 1
- The HDF Group CI
- Examine error handling per thread
- The H5T_BITFIELD class should not be directly mapped to Bool HOT 1
- Would it be possible to be able to save @enum values?
- View to a subset of a dataset HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hdf5.jl.