Giter Club home page Giter Club logo

zarr.jl's Introduction

Zarr.jl

Zarr is a Julia package providing an implementation of chunked, compressed, N-dimensional arrays. Zarr is originally a Python package. In Zarr we aim to implement the zarr spec.

Documentation Build Status

Package status

The package currently implements basic functionality for reading and writing zarr arrays. However, the package is under active development, since many compressors and backends supported by the python implementation are still missing.

Quick start

using Zarr
z1 = zcreate(Int, 10000,10000,path = "data/example.zarr",chunks=(1000, 1000))
z1[:] .= 42
z1[:,1] = 1:10000
z1[1,:] = 1:10000

z2 = zopen("data/example.zarr")
z2[1:10,1:10]

Links

zarr.jl's People

Contributors

alex-s-gardner avatar alexander-barth avatar dependabot[bot] avatar fbehr-data avatar felixcremer avatar gdkrmr avatar github-actions[bot] avatar ilia-kats avatar juliatagbot avatar meggart avatar nhz2 avatar oxinabox avatar ranocha avatar sjkelly avatar visr avatar yarikoptic avatar ziw-liu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zarr.jl's Issues

a bug fix for the general ragged array

Thanks a lot for developing this package! I want to use Zarr's ragged arrays for some tables that contain multi-dimensional arrays as elements, and I encounter some issues when I try to form a ragged array of two or higher-dimensional arrays. The issue happens when you run a code like the following.

using Zarr

arr = zcreate(Array{Float64, 2}, 10)
arr[:] .= [zeros(Float64,2,2) for i in 1:10]

This will give an error because of the lack of the _zero method for general multidimensional arrays around line 133 of ZArray.jl. I guess adding a line something like

_zero(::Type{<:Array{T, N}}) where T where N = zeros(T, zeros(Int, N)...)

will fix the issue as I think we can follow what has already been defined for Vector{T}. I really appreciate it if you can fix this problem as it is probably easily fixable. Thanks!

logo

We are missing a proper logo for this package. Any ideas, suggestions in this regard will be highly appreciated.

refactor to use transcodingstreams.jl API ?

i see in src/compressors.jl that support for zlib exists. have you considered refactoring zarr.jl to access this compressor via the transcodingstream API instead? then you could get lz4, zstd, etc. too for free.

i ask, because blosc1.jl is not thread safe (in my hands anyway), and i'd like to read from multiple zarr chunks simultaneously. codecZstd.jl is thread safe i believe, and i'd like to switch my zarr data set to use it. i'm not sure about codecZlib.jl.

note that currently transcodingstreams.jl does not support blosc, however there is an issue suggesting it do so, with code in a link: JuliaIO/Blosc.jl#79. so zarr.jl wouldn't have to drop support for blosc1 were that issue followed through.

i actually attempted to add zstd myself to zarr.jl this morning through a package extension directly using codecZstd. first started refactoring the existing blosc and zlib code to use extensions, but then realized that structs defined in extension modules cannot be exported. so the user would not have access to e.g. BloscCompressor. but i think a refactoring to use transcodingstreams.jl wouldn't even need an extension. i'm filing this issue because i want to get a zarr.jl dev's opinion before proceeding further.

Joining the Zarr Implementation council

This package (Zarr.jl) was invited to join the Zarr implementation council which will be installed in the future. In case we accept we have to nominate a representative who can join the meetings and speaks for the implementation.

I would volunteer to represent Zarr.jl at the Implementation council, but of course wanted to check with other contributors on their opinions @visr @Alexander-Barth @Fbehr-data @mkitti

Reference to the zarr-governance issue: zarr-developers/governance#18

support time data types

By trying to open a .zarr file I encounter the following error.

using Zarr
file = zopen("filename.zarr") 

ArgumentError: <M8[D] is not a valid numpy typestr Stacktrace: [1] typestr(s::String, filterlist::Nothing) @ Zarr ~/.julia/packages/Zarr/EjE1w/src/metadata.jl:55 [2] Zarr.Metadata(d::Dict{String, Any}) @ Zarr ~/.julia/packages/Zarr/EjE1w/src/metadata.jl:142 [3] Zarr.Metadata(s::String) @ Zarr ~/.julia/packages/Zarr/EjE1w/src/metadata.jl:131 [4] getmetadata(s::DirectoryStore) @ Zarr ~/.julia/packages/Zarr/EjE1w/src/Storage/Storage.jl:100 [5] ZArray(s::DirectoryStore, mode::String) @ Zarr ~/.julia/packages/Zarr/EjE1w/src/ZArray.jl:95 [6] zopen(s::DirectoryStore, mode::String; consolidated::Bool) @ Zarr ~/.julia/packages/Zarr/EjE1w/src/ZGroup.jl:62 [7] zopen(s::DirectoryStore, mode::String) @ Zarr ~/.julia/packages/Zarr/EjE1w/src/ZGroup.jl:60 [8] ZGroup(s::DirectoryStore, mode::String) @ Zarr ~/.julia/packages/Zarr/EjE1w/src/ZGroup.jl:18 [9] zopen(s::DirectoryStore, mode::String; consolidated::Bool) @ Zarr ~/.julia/packages/Zarr/EjE1w/src/ZGroup.jl:64 [10] zopen(s::DirectoryStore, mode::String) @ Zarr ~/.julia/packages/Zarr/EjE1w/src/ZGroup.jl:60 [11] #zopen#75 @ ~/.julia/packages/Zarr/EjE1w/src/ZGroup.jl:77 [inlined] [12] zopen (repeats 2 times) @ ~/.julia/packages/Zarr/EjE1w/src/ZGroup.jl:77 [inlined] [13] top-level scope @ In[19]:2 [14] eval @ ./boot.jl:360 [inlined] [15] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String) @ Base ./loading.jl:1094

Zarr.jl

Hi Fabian,

Cool to see you are working on this. Didn't really go through the code yet, but will do later.

I had also made a small start, but basically the only part that was working was the metadata :)
Since I had so little and wasn't sure I'd go through with it, I didn't announce it anywhere. But here it is:
https://gitlab.com/visr/Zarr.jl

If you are up for it, we can join efforts and compare approaches.

I guess we can just call this Zarr.jl eventually, and rename Zarr.jl to ZarrPyCall or something like that?

MinIO test fails locally

I am trying to test Zarr locally and I get the following error in the MinIO S3 test.

Minio S3 storage: Error During Test at /home/fcremer/Documents/PyramidScheme/dev/Zarr/test/storage.jl:153
  Got exception outside of a @test
  AWS.AWSExceptions.AWSException: SlowDown -- Resource requested is unreadable, please reduce your request rate
  
  HTTP.Exceptions.StatusError(503, "PUT", "/zarrdata", HTTP.Messages.Response:
  """
  HTTP/1.1 503 Service Unavailable
  Accept-Ranges: bytes
  Content-Length: 314
  Content-Security-Policy: block-all-mixed-content
  Content-Type: application/xml
  Retry-After: 120
  Server: MinIO
  Strict-Transport-Security: max-age=31536000; includeSubDomains
  Vary: Origin, Accept-Encoding
  X-Amz-Request-Id: 17C8910F51C5F95B
  X-Content-Type-Options: nosniff
  X-Xss-Protection: 1; mode=block
  Date: Mon, 22 Apr 2024 09:45:53 GMT
  
  [Message Body was streamed]""")
  
  <?xml version="1.0" encoding="UTF-8"?>
  <Error><Code>SlowDown</Code><Message>Resource requested is unreadable, please reduce your request rate</Message><BucketName>zarrdata</BucketName><Resource>/zarrdata</Resource><RequestId>17C8910F51C5F95B</RequestId><HostId>16de69c3-d609-499d-ae6d-67c17b8d9196</HostId></Error>
  
  Stacktrace:
    [1] (::HTTP.ConnectionRequest.var"#connections#4"{HTTP.ConnectionRequest.var"#connections#1#5"{HTTP.TimeoutRequest.var"#timeouts#3"{HTTP.TimeoutRequest.var"#timeouts#1#4"{HTTP.ExceptionRequest.var"#exceptions#2"{HTTP.ExceptionRequest.var"#exceptions#1#3"{typeof(HTTP.StreamRequest.streamlayer)}}}}}})(req::HTTP.Messages.Request; proxy::Nothing, socket_type::Type, socket_type_tls::Type, readtimeout::Int64, connect_timeout::Int64, logerrors::Bool, logtag::Nothing, kw::@Kwargs{iofunction::Nothing, decompress::Nothing, verbose::Int64})
      @ HTTP.ConnectionRequest ~/.julia/packages/HTTP/vnQzp/src/clientlayers/ConnectionRequest.jl:141
    [2] (::HTTP.RetryRequest.var"#manageretries#3"{HTTP.RetryRequest.var"#manageretries#1#4"{HTTP.ConnectionRequest.var"#connections#4"{HTTP.ConnectionRequest.var"#connections#1#5"{HTTP.TimeoutRequest.var"#timeouts#3"{HTTP.TimeoutRequest.var"#timeouts#1#4"{HTTP.ExceptionRequest.var"#exceptions#2"{HTTP.ExceptionRequest.var"#exceptions#1#3"{typeof(HTTP.StreamRequest.streamlayer)}}}}}}}})(req::HTTP.Messages.Request; retry::Bool, retries::Int64, retry_delays::ExponentialBackOff, retry_check::Function, retry_non_idempotent::Bool, kw::@Kwargs{iofunction::Nothing, decompress::Nothing, verbose::Int64})
      @ HTTP.RetryRequest ~/.julia/packages/HTTP/vnQzp/src/clientlayers/RetryRequest.jl:35
    [3] manageretries
      @ ~/.julia/packages/HTTP/vnQzp/src/clientlayers/RetryRequest.jl:30 [inlined]
    [4] (::HTTP.CookieRequest.var"#managecookies#4"{HTTP.CookieRequest.var"#managecookies#1#5"{HTTP.RetryRequest.var"#manageretries#3"{HTTP.RetryRequest.var"#manageretries#1#4"{HTTP.ConnectionRequest.var"#connections#4"{HTTP.ConnectionRequest.var"#connections#1#5"{HTTP.TimeoutRequest.var"#timeouts#3"{HTTP.TimeoutRequest.var"#timeouts#1#4"{HTTP.ExceptionRequest.var"#exceptions#2"{HTTP.ExceptionRequest.var"#exceptions#1#3"{typeof(HTTP.StreamRequest.streamlayer)}}}}}}}}}})(req::HTTP.Messages.Request; cookies::Bool, cookiejar::HTTP.Cookies.CookieJar, kw::@Kwargs{iofunction::Nothing, decompress::Nothing, verbose::Int64, retry::Bool})
      @ HTTP.CookieRequest ~/.julia/packages/HTTP/vnQzp/src/clientlayers/CookieRequest.jl:42
    [5] managecookies
      @ ~/.julia/packages/HTTP/vnQzp/src/clientlayers/CookieRequest.jl:19 [inlined]
    [6] (::HTTP.HeadersRequest.var"#defaultheaders#2"{HTTP.HeadersRequest.var"#defaultheaders#1#3"{HTTP.CookieRequest.var"#managecookies#4"{HTTP.CookieRequest.var"#managecookies#1#5"{HTTP.RetryRequest.var"#manageretries#3"{HTTP.RetryRequest.var"#manageretries#1#4"{HTTP.ConnectionRequest.var"#connections#4"{HTTP.ConnectionRequest.var"#connections#1#5"{HTTP.TimeoutRequest.var"#timeouts#3"{HTTP.TimeoutRequest.var"#timeouts#1#4"{HTTP.ExceptionRequest.var"#exceptions#2"{HTTP.ExceptionRequest.var"#exceptions#1#3"{typeof(HTTP.StreamRequest.streamlayer)}}}}}}}}}}}})(req::HTTP.Messages.Request; iofunction::Nothing, decompress::Nothing, basicauth::Bool, detect_content_type::Bool, canonicalize_headers::Bool, kw::@Kwargs{verbose::Int64, retry::Bool})
      @ HTTP.HeadersRequest ~/.julia/packages/HTTP/vnQzp/src/clientlayers/HeadersRequest.jl:71
    [7] defaultheaders
      @ ~/.julia/packages/HTTP/vnQzp/src/clientlayers/HeadersRequest.jl:14 [inlined]
    [8] (::HTTP.RedirectRequest.var"#redirects#3"{HTTP.RedirectRequest.var"#redirects#1#4"{HTTP.HeadersRequest.var"#defaultheaders#2"{HTTP.HeadersRequest.var"#defaultheaders#1#3"{HTTP.CookieRequest.var"#managecookies#4"{HTTP.CookieRequest.var"#managecookies#1#5"{HTTP.RetryRequest.var"#manageretries#3"{HTTP.RetryRequest.var"#manageretries#1#4"{HTTP.ConnectionRequest.var"#connections#4"{HTTP.ConnectionRequest.var"#connections#1#5"{HTTP.TimeoutRequest.var"#timeouts#3"{HTTP.TimeoutRequest.var"#timeouts#1#4"{HTTP.ExceptionRequest.var"#exceptions#2"{HTTP.ExceptionRequest.var"#exceptions#1#3"{typeof(HTTP.StreamRequest.streamlayer)}}}}}}}}}}}}}})(req::HTTP.Messages.Request; redirect::Bool, redirect_limit::Int64, redirect_method::Nothing, forwardheaders::Bool, response_stream::Base.BufferStream, kw::@Kwargs{verbose::Int64, retry::Bool})
      @ HTTP.RedirectRequest ~/.julia/packages/HTTP/vnQzp/src/clientlayers/RedirectRequest.jl:17
    [9] redirects
      @ ~/.julia/packages/HTTP/vnQzp/src/clientlayers/RedirectRequest.jl:14 [inlined]
   [10] (::HTTP.MessageRequest.var"#makerequest#3"{HTTP.MessageRequest.var"#makerequest#1#4"{HTTP.RedirectRequest.var"#redirects#3"{HTTP.RedirectRequest.var"#redirects#1#4"{HTTP.HeadersRequest.var"#defaultheaders#2"{HTTP.HeadersRequest.var"#defaultheaders#1#3"{HTTP.CookieRequest.var"#managecookies#4"{HTTP.CookieRequest.var"#managecookies#1#5"{HTTP.RetryRequest.var"#manageretries#3"{HTTP.RetryRequest.var"#manageretries#1#4"{HTTP.ConnectionRequest.var"#connections#4"{HTTP.ConnectionRequest.var"#connections#1#5"{HTTP.TimeoutRequest.var"#timeouts#3"{HTTP.TimeoutRequest.var"#timeouts#1#4"{HTTP.ExceptionRequest.var"#exceptions#2"{HTTP.ExceptionRequest.var"#exceptions#1#3"{typeof(HTTP.StreamRequest.streamlayer)}}}}}}}}}}}}}}}})(method::String, url::URIs.URI, headers::Vector{Pair{SubString{String}, SubString{String}}}, body::String; copyheaders::Bool, response_stream::Base.BufferStream, http_version::HTTP.Strings.HTTPVersion, verbose::Int64, kw::@Kwargs{redirect::Bool, retry::Bool})
      @ HTTP.MessageRequest ~/.julia/packages/HTTP/vnQzp/src/clientlayers/MessageRequest.jl:35
   [11] makerequest
      @ ~/.julia/packages/HTTP/vnQzp/src/clientlayers/MessageRequest.jl:24 [inlined]
   [12] request(stack::HTTP.MessageRequest.var"#makerequest#3"{HTTP.MessageRequest.var"#makerequest#1#4"{HTTP.RedirectRequest.var"#redirects#3"{HTTP.RedirectRequest.var"#redirects#1#4"{HTTP.HeadersRequest.var"#defaultheaders#2"{HTTP.HeadersRequest.var"#defaultheaders#1#3"{HTTP.CookieRequest.var"#managecookies#4"{HTTP.CookieRequest.var"#managecookies#1#5"{HTTP.RetryRequest.var"#manageretries#3"{HTTP.RetryRequest.var"#manageretries#1#4"{HTTP.ConnectionRequest.var"#connections#4"{HTTP.ConnectionRequest.var"#connections#1#5"{HTTP.TimeoutRequest.var"#timeouts#3"{HTTP.TimeoutRequest.var"#timeouts#1#4"{HTTP.ExceptionRequest.var"#exceptions#2"{HTTP.ExceptionRequest.var"#exceptions#1#3"{typeof(HTTP.StreamRequest.streamlayer)}}}}}}}}}}}}}}}}, method::String, url::URIs.URI, h::Vector{Pair{SubString{String}, SubString{String}}}, b::String, q::Nothing; headers::Vector{Pair{SubString{String}, SubString{String}}}, body::String, query::Nothing, kw::@Kwargs{redirect::Bool, retry::Bool, response_stream::Base.BufferStream})
      @ HTTP ~/.julia/packages/HTTP/vnQzp/src/HTTP.jl:457
   [13] #request#20
      @ ~/.julia/packages/HTTP/vnQzp/src/HTTP.jl:315 [inlined]
   [14] macro expansion
      @ ~/.julia/packages/Mocking/Q17aB/src/mock.jl:29 [inlined]
   [15] (::AWS.var"#48#50"{AWS.Request, OrderedCollections.LittleDict{Symbol, Any, Vector{Symbol}, Vector{Any}}})()
      @ AWS ~/.julia/packages/AWS/Fxun1/src/utilities/request.jl:225
   [16] (::Base.var"#96#98"{Base.var"#96#97#99"{AWS.AWSExponentialBackoff, AWS.var"#49#51", AWS.var"#48#50"{AWS.Request, OrderedCollections.LittleDict{Symbol, Any, Vector{Symbol}, Vector{Any}}}}})(; kwargs::@Kwargs{})
      @ Base ./error.jl:308
   [17] (::Base.var"#96#98"{Base.var"#96#97#99"{AWS.AWSExponentialBackoff, AWS.var"#49#51", AWS.var"#48#50"{AWS.Request, OrderedCollections.LittleDict{Symbol, Any, Vector{Symbol}, Vector{Any}}}}})()
      @ Base ./error.jl:291
   [18] _http_request(http_backend::AWS.HTTPBackend, request::AWS.Request, response_stream::IOBuffer)
      @ AWS ~/.julia/packages/AWS/Fxun1/src/utilities/request.jl:250
   [19] macro expansion
      @ ~/.julia/packages/Mocking/Q17aB/src/mock.jl:29 [inlined]
   [20] (::AWS.var"#41#44"{MinioConfig, AWS.Request, IOBuffer, Vector{Int64}})()
      @ AWS ~/.julia/packages/AWS/Fxun1/src/utilities/request.jl:134
   [21] (::AWS.var"#42#46"{AWS.var"#41#44"{MinioConfig, AWS.Request, IOBuffer, Vector{Int64}}, IOBuffer})()
      @ AWS ~/.julia/packages/AWS/Fxun1/src/utilities/request.jl:149
   [22] (::Base.var"#96#98"{Base.var"#96#97#99"{AWS.AWSExponentialBackoff, AWS.var"#43#47"{MinioConfig, Vector{String}, Vector{String}, Int64}, AWS.var"#42#46"{AWS.var"#41#44"{MinioConfig, AWS.Request, IOBuffer, Vector{Int64}}, IOBuffer}}})(; kwargs::@Kwargs{})
      @ Base ./error.jl:308
   [23] (::Base.var"#96#98"{Base.var"#96#97#99"{AWS.AWSExponentialBackoff, AWS.var"#43#47"{MinioConfig, Vector{String}, Vector{String}, Int64}, AWS.var"#42#46"{AWS.var"#41#44"{MinioConfig, AWS.Request, IOBuffer, Vector{Int64}}, IOBuffer}}})()
      @ Base ./error.jl:291
   [24] submit_request(aws::MinioConfig, request::AWS.Request; return_headers::Nothing)
      @ AWS ~/.julia/packages/AWS/Fxun1/src/utilities/request.jl:200
   [25] (::AWS.RestXMLService)(request_method::String, request_uri::String, args::Dict{String, Any}; aws_config::MinioConfig, feature_set::AWS.FeatureSet)
      @ AWS ~/.julia/packages/AWS/Fxun1/src/AWS.jl:287
   [26] RestXMLService (repeats 2 times)
      @ ~/.julia/packages/AWS/Fxun1/src/AWS.jl:251 [inlined]
   [27] create_bucket(Bucket::String; aws_config::MinioConfig)
      @ AWSS3.S3 ~/.julia/packages/AWS/Fxun1/src/services/s3.jl:557
   [28] create_bucket(Bucket::String)
      @ AWSS3.S3 ~/.julia/packages/AWS/Fxun1/src/services/s3.jl:556
   [29] macro expansion
      @ ~/Documents/PyramidScheme/dev/Zarr/test/storage.jl:163 [inlined]
   [30] macro expansion
      @ ~/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [31] top-level scope
      @ ~/Documents/PyramidScheme/dev/Zarr/test/storage.jl:154
   [32] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [33] macro expansion
      @ ~/Documents/PyramidScheme/dev/Zarr/test/runtests.jl:260 [inlined]
   [34] macro expansion
      @ ~/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [35] top-level scope
      @ ~/Documents/PyramidScheme/dev/Zarr/test/runtests.jl:17
   [36] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [37] top-level scope
      @ none:6
   [38] eval
      @ ./boot.jl:385 [inlined]
   [39] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:291
   [40] _start()
      @ Base ./client.jl:552

Support bitshuffle in `BloscCompressor`

Currently, BloscCompressor shuffle field is a Bool, this means compressing with bit shuffle is not possible.

struct BloscCompressor <: Compressor
blocksize::Int
clevel::Int
cname::String
shuffle::Bool
end

however in numcodecs,

https://numcodecs.readthedocs.io/en/stable/blosc.html shuffle is an integer:

Either NOSHUFFLE (0), SHUFFLE (1), BITSHUFFLE (2) or AUTOSHUFFLE (-1). If AUTOSHUFFLE, bit-shuffle will be used for buffers with itemsize 1, and byte-shuffle will be used otherwise. The default is SHUFFLE.

I think the shuffle field type should be changed to Int and the logic in
https://github.com/zarr-developers/numcodecs/blob/7d8f9762b4f0f9b5e135688b2eeb3f783f90f208/numcodecs/blosc.pyx#L264-L272 implemented in this package.

I'm happy to make a PR if you think this is a good idea.

Zarr from gdalwarp can not be read because of fill_value is Nothing

I converted a tif file using gdalwarp -of Zarr path.tif path.zarr
This gives a .zarr file and I can open it with Zarr.jl but when I try to read the data it fails with the following error while filling the array with the fill value which is Nothing. The tif file and the converted Zarr file is attached to the issue. I am not sure, whether this is a Zarr.jl or a GDAL bug.

julia> z = zopen(path)
ZarrGroup at DirectoryStore("test/data/cea.zarr") and path 
Variables: Y X cea 

julia> z.arrays
Dict{String, ZArray} with 3 entries:
  "Y"   => ZArray{Float64} of size 515
  "X"   => ZArray{Float64} of size 514
  "cea" => ZArray{UInt8} of size 514 x 515

julia> using ZarrDatasets^C

julia> z["cea"][:,:]
ERROR: MethodError: Cannot `convert` an object of type Nothing to an object of type UInt8

Closest candidates are:
  convert(::Type{T}, ::EzXML.NodeType) where T<:Integer
   @ EzXML ~/.julia/packages/EzXML/DL8na/src/node.jl:36
  convert(::Type{T}, ::EzXML.ReaderType) where T<:Integer
   @ EzXML ~/.julia/packages/EzXML/DL8na/src/streamreader.jl:59
  convert(::Type{T}, ::Number) where T<:Number
   @ Base number.jl:7
  ...

Stacktrace:
 [1] fill!(dest::Matrix{UInt8}, x::Nothing)
   @ Base ./array.jl:393
 [2] uncompress_raw!
   @ ~/.julia/packages/Zarr/uTqS1/src/ZArray.jl:259 [inlined]
 [3] uncompress_to_output!(aout::Matrix{…}, output_base_offsets::Tuple{…}, z::ZArray{…}, chunk_compressed::Nothing, current_chunk_offsets::Tuple{…}, a::Matrix{…}, indranges::Tuple{…})
   @ Zarr ~/.julia/packages/Zarr/uTqS1/src/ZArray.jl:270
 [4] readblock!(aout::Matrix{…}, z::ZArray{…}, r::CartesianIndices{…})
   @ Zarr ~/.julia/packages/Zarr/uTqS1/src/ZArray.jl:178
 [5] readblock!(::ZArray{…}, ::Matrix{…}, ::Base.OneTo{…}, ::Vararg{…})
   @ Zarr ~/.julia/packages/Zarr/uTqS1/src/ZArray.jl:247
 [6] getindex_disk(::ZArray{UInt8, 2, Zarr.NoCompressor, DirectoryStore}, ::Function, ::Vararg{Function})
   @ DiskArrays ~/.julia/packages/DiskArrays/bZBJE/src/diskarray.jl:40
 [7] getindex(::ZArray{UInt8, 2, Zarr.NoCompressor, DirectoryStore}, ::Function, ::Function)
   @ DiskArrays ~/.julia/packages/DiskArrays/bZBJE/src/diskarray.jl:211
 [8] top-level scope
   @ REPL[24]:1
Some type information was truncated. Use `show(err)` to see complete types.

Storage not empty error with DictStore

When writing to a hierarchical group using a DictStore, if I create an array with some name and then create an array with a name that the previous array starts with, an error is raised. Here's a MWE:

julia> using Zarr

julia> d = Zarr.DictStore()
Dictionary Storage

julia> g = zgroup(d)
ZarrGroup at Dictionary Storage and path 

julia> g2 = zgroup(g, "group")
ZarrGroup at Dictionary Storage and path group

julia> zcreate(Float64, g2, "foo_bar", 5)
ZArray{Float64} of size 5

julia> zcreate(Float64, g2, "foo", 5)
ERROR: Dictionary Storage group/foo is not empty
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] zcreate(::Type{Float64}, storage::Zarr.DictStore, dims::Int64; path::String, chunks::Tuple{Int64}, fill_value::Nothing, fill_as_missing::Bool, compressor::Zarr.BloscCompressor, filters::Nothing, attrs::Dict{Any, Any}, writeable::Bool)
   @ Zarr ~/.julia/packages/Zarr/T6jFD/src/ZArray.jl:270
 [3] zcreate(::Type{Float64}, ::ZGroup{Zarr.DictStore}, ::String, ::Int64, ::Vararg{Int64}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Zarr ~/.julia/packages/Zarr/T6jFD/src/ZGroup.jl:151
 [4] zcreate(::Type{Float64}, ::ZGroup{Zarr.DictStore}, ::String, ::Int64, ::Vararg{Int64})
   @ Zarr ~/.julia/packages/Zarr/T6jFD/src/ZGroup.jl:148
 [5] top-level scope
   @ REPL[45]:1

julia> zcreate(Float64, g2, "baz", 5)
ZArray{Float64} of size 5

julia> zcreate(Float64, g2, "baz_baz", 5)  # writing the substring first is fine
ZArray{Float64} of size 5

Switch S3Store from AWS.jl to AWSS3.jl

AWSS3.jl provides convenience function for filesystem-like operations on S3 stores. It would be good to rewrite the S3Store to use AWSS3 instead of AWS directly.

Set/get index cases not working, possibly depending on chunks

I encounter strange errors of Zarr refusing to write and read data in certain chunk combimations/dimensions

ztest = zcreate(Float32, (4,4,4)..., path="test.zarr", chunks=(2,2,2))

ztest[2,:,1:2]= zeros(Int64, 1, 4,2) # this works

@show ztest[2,:,1:2] # this works

@show ztest[:,:,:] # this fails

@show ztest[1,:,:] # this also fails, with the error:

MethodError: Cannot `convert` an object of type Nothing to an object of type Float32
Closest candidates are:
  convert(::Type{T}, !Matched::Ratios.SimpleRatio{S}) where {T<:AbstractFloat, S} at C:\Users\vilim\.julia\packages\Ratios\uRs4y\src\Ratios.jl:14
  convert(::Type{T}, !Matched::AbstractNumbers.AbstractNumber) where T<:Number at C:\Users\vilim\.julia\packages\AbstractNumbers\fUFfp\src\AbstractNumbers.jl:22
  convert(::Type{T}, !Matched::T) where T<:Number at number.jl:6
  ...

Stacktrace:
 [1] fill!(::Array{Float32,3}, ::Nothing) at .\array.jl:336
 [2] readchunk!(::Array{Float32,3}, ::ZArray{Float32,3,Zarr.BloscCompressor,DirectoryStore}, ::CartesianIndex{3}) at C:\Users\vilim\.julia\packages\Zarr\It8wz\src\ZArray.jl:224
 [3] (::Zarr.var"#31#34"{Bool,ZArray{Float32,3,Zarr.BloscCompressor,DirectoryStore},CartesianIndices{3,Tuple{UnitRange{Int64},Base.OneTo{Int64},Base.OneTo{Int64}}},Array{Float32,3}})(::CartesianIndex{3}) at C:\Users\vilim\.julia\packages\Zarr\It8wz\src\ZArray.jl:155
 [4] foreach at .\abstractarray.jl:1919 [inlined]
 [5] readblock!(::Array{Float32,3}, ::ZArray{Float32,3,Zarr.BloscCompressor,DirectoryStore}, ::CartesianIndices{3,Tuple{UnitRange{Int64},Base.OneTo{Int64},Base.OneTo{Int64}}}; readmode::Bool) at C:\Users\vilim\.julia\packages\Zarr\It8wz\src\ZArray.jl:145
 [6] readblock!(::Array{Float32,3}, ::ZArray{Float32,3,Zarr.BloscCompressor,DirectoryStore}, ::CartesianIndices{3,Tuple{UnitRange{Int64},Base.OneTo{Int64},Base.OneTo{Int64}}}) at C:\Users\vilim\.julia\packages\Zarr\It8wz\src\ZArray.jl:139
 [7] getindex(::ZArray{Float32,3,Zarr.BloscCompressor,DirectoryStore}, ::Int64, ::Function, ::Function) at C:\Users\vilim\.julia\packages\Zarr\It8wz\src\ZArray.jl:187
 [8] top-level scope at In[165]:1

This is with the current version of Zarr, 0.4.4, on Julia 1.4.2 on Windows

Very slow S3 read realtive to xarray

Thanks a ton for the great package... very happy to see support for Zarr being added to Julia. Zarr is taking off so I suspect this package is going to get a lot of use. Working with Zarr.jl I've noticed very slow read times relative to what I can achieve with xarray (with zarr engine) in Python.

Reading a column of data in Julia takes 1.7 min vrs. 5 sec using xarray in Python... I really want to start moving to Julia and am very keen to see the tools improve.

------------------ Julia Zarr Read ------------------

using Zarr, AWS

# set up aws configuration
AWS.global_aws_config(AWSConfig(creds=nothing, region = "us-west-2"))

# path to datacube
path = "s3://its-live-data.jpl.nasa.gov/datacubes/v1/N60W040/ITS_LIVE_vel_EPSG3413_G0120_X-150000_Y-2250000.zarr"

# map into datacube
z = zopen(path, consolidated=true)

# map to specific variables
v = z["v"]

# read data
foo = v[1,1,:] #takes ~1.7 min

%-------------- Python Zarr read ------------------

import s3fs 
import xarray as xr

output_dir = "s3://its-live-data.jpl.nasa.gov/datacubes/v1/N60W040/ITS_LIVE_vel_EPSG3413_G0120_X-150000_Y-2250000.zarr"

s3_in = s3fs.S3FileSystem(anon=True)

cube_store = s3fs.S3Map(root=output_dir, s3=s3_in, check=False)
with xr.open_dataset(cube_store, decode_timedelta=False, engine='zarr', consolidated=True) as ds:
    grid_cell_v = ds.v.isel(x=1, y=1)
    grid_cell_v.mean()

Zarr group API is exported but not shown in the documentation

For example, the docstring for zgroup() is not rendered in the docs:

Zarr.jl/src/ZGroup.jl

Lines 124 to 129 in e5c7240

"""
zgroup(s::AbstractStore; attrs=Dict())
Create a new zgroup in the store `s`
"""
function zgroup(s::AbstractStore, path::String=""; attrs=Dict())

Also there might be some separate issues with the documentation building workflow where the stable versions after 0.3.2 are not built individually.

`mode=r+`: Cannot write to read-only ZArray

This fails:

p = "/tmp/example.zarr"
z1 = zcreate(Int, 10000, 10000, path = p, chunks = (1000, 1000))
z1[1, :] = 1:10000
mode = "r+"
z2 = zopen(p,mode)
z2[1,2] = 42
z1[1,1:3]

This also fails for mode="a" or mode="a+", but works on mode="w". Unfortunately, I cannot use mode="w" as I do not wish to truncate the file, but rather re-open an existing Zarr array for modification. On testing, it appears that mode="w" does not actually truncate the .zarr directory, which I believe conflicts with julia's definition for open modes / is inconsistent with python Zarr.

Much appreciate your efforts to bring Zarr to Julia!

missing chunks to be filled with fill values (when server returns HTTP error 403)

When I try to load the following dataset with Zarr.jl, I get unfortunately an error:

using Zarr

ds = Zarr.zopen("https://s3.waw3-1.cloudferro.com/mdl-arco-time/arco/MEDSEA_MULTIYEAR_PHY_006_004/med-cmcc-cur-rean-d_202012/timeChunked.zarr")

ds["uo"][:,:,1,1]
# full error below

Yet, the data can be read with python zarr

import zarr

z = zarr.open("https://s3.waw3-1.cloudferro.com/mdl-arco-time/arco/MEDSEA_MULTIYEAR_PHY_006_004/med-cmcc-cur-rean-d_202012/timeChunked.zarr");

gz = z["uo"]
data = gz[0,0,:,:];

Note that all the data is filled with fill value (1e20) for this chunk. According to the OGC spec, it seems to be ok that not all chunks are present:

There is no need for all chunks to be present within an array store. If a chunk is not present then
it is considered to be in an uninitialized state. An unitialized chunk MUST be treated as if it was
uniformly filled with the value of the “fill_value” field in the array metadata. If the “fill_value” field
is null then the contents of the chunk are undefined.

Can Zarr.jl handle this case too? Are you accepting a PR for this issue?

I am using Zarr v0.9.1.

Thank for this great package, by the way :-)

Full error from Zarr.jl:

ERROR: TaskFailedException
Stacktrace:
  [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
    @ Base ./task.jl:920
  [2] wait()
    @ Base ./task.jl:984
  [3] wait(c::Base.GenericCondition{ReentrantLock}; first::Bool)
    @ Base ./condition.jl:130
  [4] wait
    @ ./condition.jl:125 [inlined]
  [5] take_buffered(c::Channel{Pair{CartesianIndex{4}, Union{Nothing, Vector{UInt8}}}})
    @ Base ./channels.jl:456
  [6] take!
    @ ./channels.jl:450 [inlined]
  [7] readblock!(aout::Array{Float32, 4}, z::ZArray{Float32, 4, Zarr.BloscCompressor, Zarr.ConsolidatedStore{Zarr.HTTPStore}}, r::CartesianIndices{4, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, UnitRange{Int64}, UnitRange{Int64}}})   
    @ Zarr ~/.julia/dev/Zarr/src/ZArray.jl:172
  [8] readblock!(::ZArray{Float32, 4, Zarr.BloscCompressor, Zarr.ConsolidatedStore{Zarr.HTTPStore}}, ::Array{Float32, 4}, ::Base.OneTo{Int64}, ::Vararg{AbstractUnitRange})
    @ Zarr ~/.julia/dev/Zarr/src/ZArray.jl:247
  [9] getindex_disk(::ZArray{Float32, 4, Zarr.BloscCompressor, Zarr.ConsolidatedStore{Zarr.HTTPStore}}, ::Function, ::Vararg{Any})
    @ DiskArrays ~/.julia/dev/DiskArrays/src/diskarray.jl:44
 [10] getindex(::ZArray{Float32, 4, Zarr.BloscCompressor, Zarr.ConsolidatedStore{Zarr.HTTPStore}}, ::Function, ::Function, ::Int64, ::Int64)
    @ DiskArrays ~/.julia/dev/DiskArrays/src/diskarray.jl:215
 [11] top-level scope
    @ REPL[229]:1
 [12] top-level scope
    @ ~/.julia/packages/CUDA/35NC6/src/initialization.jl:190

    nested task error: Error connecting to https://s3.waw3-1.cloudferro.com/mdl-arco-time/arco/MEDSEA_MULTIYEAR_PHY_006_004/med-cmcc-cur-rean-d_202012/timeChunked.zarr :<?xml version="1.0" encoding="UTF-8"?><Error><Code>AccessDenied</Code><BucketName>mdl-arco-time</BucketName><RequestId>tx0000000000000003c6076-00656d923c-9f064368-default</RequestId><HostId>9f064368-default-waw3-1</HostId></Error>
    Stacktrace:
     [1] error(::String, ::String)
       @ Base ./error.jl:44
     [2] getindex(s::Zarr.HTTPStore, k::String)
       @ Zarr ~/.julia/dev/Zarr/src/Storage/http.jl:24
     [3] getindex
       @ ~/.julia/dev/Zarr/src/Storage/consolidated.jl:27 [inlined]
     [4] getindex
       @ ~/.julia/dev/Zarr/src/Storage/Storage.jl:55 [inlined]
     [5] getindex
       @ ~/.julia/dev/Zarr/src/Storage/Storage.jl:54 [inlined]
     [6] (::Zarr.var"#10#11"{Zarr.ConsolidatedStore{Zarr.HTTPStore}, Channel{Pair{CartesianIndex{4}, Union{Nothing, Vector{UInt8}}}}, String})(ii::CartesianIndex{4})
       @ Zarr ~/.julia/dev/Zarr/src/Storage/Storage.jl:121
     [7] (::Base.var"#978#983"{Zarr.var"#10#11"{Zarr.ConsolidatedStore{Zarr.HTTPStore}, Channel{Pair{CartesianIndex{4}, Union{Nothing, Vector{UInt8}}}}, String}})(r::Base.RefValue{Any}, args::Tuple{CartesianIndex{4}})                                     
       @ Base ./asyncmap.jl:100
     [8] macro expansion
       @ ./asyncmap.jl:234 [inlined]
     [9] (::Base.var"#994#995"{Base.var"#978#983"{Zarr.var"#10#11"{Zarr.ConsolidatedStore{Zarr.HTTPStore}, Channel{Pair{CartesianIndex{4}, Union{Nothing, Vector{UInt8}}}}, String}}, Channel{Any}, Nothing})()
       @ Base ./task.jl:514
    Stacktrace:
      [1] (::Base.var"#988#990")(x::Task)
        @ Base ./asyncmap.jl:177
      [2] foreach(f::Base.var"#988#990", itr::Vector{Any})
        @ Base ./abstractarray.jl:3073
      [3] maptwice(wrapped_f::Function, chnl::Channel{Any}, worker_tasks::Vector{Any}, c
::CartesianIndices{4, NTuple{4, UnitRange{Int64}}})                                    
        @ Base ./asyncmap.jl:177
      [4] wrap_n_exec_twice
        @ ./asyncmap.jl:153 [inlined]
      [5] async_usemap(f::Zarr.var"#10#11"{Zarr.ConsolidatedStore{Zarr.HTTPStore}, Channel{Pair{CartesianIndex{4}, Union{Nothing, Vector{UInt8}}}}, String}, c::CartesianIndices
{4, NTuple{4, UnitRange{Int64}}}; ntasks::Int64, batch_size::Nothing)                  
        @ Base ./asyncmap.jl:103
      [6] async_usemap
        @ ./asyncmap.jl:84 [inlined]
      [7] #asyncmap#972
        @ ./asyncmap.jl:81 [inlined]
      [8] asyncmap
        @ ./asyncmap.jl:80 [inlined]
      [9] read_items!
        @ ~/.julia/dev/Zarr/src/Storage/Storage.jl:119 [inlined]
     [10] read_items!
        @ ~/.julia/dev/Zarr/src/Storage/Storage.jl:109 [inlined]
     [11] macro expansion
        @ ~/.julia/dev/Zarr/src/ZArray.jl:165 [inlined]
     [12] (::Zarr.var"#63#66"{Channel{Pair{CartesianIndex{4}, Union{Nothing, Vector{UInt8}}}}, ZArray{Float32, 4, Zarr.BloscCompressor, Zarr.ConsolidatedStore{Zarr.HTTPStore}}, ZArray{Float32, 4, Zarr.BloscCompressor, Zarr.ConsolidatedStore{Zarr.HTTPStore}}, CartesianIndices{4, NTuple{4, UnitRange{Int64}}}})()
        @ Zarr ./task.jl:514

Registering the package and package name

Hi @visr

I would like to move forward and make a first release of the package as it is right now. I have added a lot of unit tests and tried to improve on the documentation. Unfortunately the S3 tests are still broken until AWSCore tags a new release, but since the object store is not yet documented anyway that should not be a problem.

The main reason that I want to register is that I need this as a dependency for another unregistered package. Would you have any objection against this?

I think there were two issues we did not solve when we started: package name and where it is hosted. Do you think we should move this to JuliaGeo or somewhere else before registering () the format is actually not Geo-specific) and should we rename the package to Zarr.jl instead of ZarrNative?

setindex! with CartesianIndices fails

The following fails with the error message

ArgumentError: number of indices (4) must match the parent dimensionality (2)

While indexing with CartesianIndices{2}.

using Zarr
using DiskArrays
z = zcreate(Int, 10000,10000,path = "data/example.zarr",chunks=(1000, 1000))
z[:] .= 42
I = first(DiskArrays.eachchunk(z))  # -> 1000×1000 CartesianIndices{2,Tuple{UnitRange{Int64},UnitRange{Int64}}}
z[I]  # -> 1000×1000 Array{Int64,2} (expected)
z[I] .= 2
ERROR: ArgumentError: number of indices (4) must match the parent dimensionality (2)
Stacktrace:
 [1] check_parent_index_match(::ZArray{Int64,2,Zarr.BloscCompressor,DirectoryStore}, ::NTuple{4,Bool}) at .\subarray.jl:43
 [2] check_parent_index_match(::ZArray{Int64,2,Zarr.BloscCompressor,DirectoryStore}, ::Tuple{CartesianIndices{2,Tuple{UnitRange{Int64},UnitRange{Int64}}},CartesianIndices{2,Tuple{UnitRange{Int64},UnitRange{Int64}}}}) at .\subarray.jl:41
 [3] SubArray at .\subarray.jl:21 [inlined]
 [4] SubArray at .\subarray.jl:32 [inlined]
 [5] SubArray(::ZArray{Int64,2,Zarr.BloscCompressor,DirectoryStore}, ::Tuple{CartesianIndices{2,Tuple{UnitRange{Int64},UnitRange{Int64}}},CartesianIndices{2,Tuple{UnitRange{Int64},UnitRange{Int64}}}}) at .\subarray.jl:28
 [6] view(::ZArray{Int64,2,Zarr.BloscCompressor,DirectoryStore}, ::CartesianIndices{2,Tuple{UnitRange{Int64},UnitRange{Int64}}}) at C:\Users\visser_mn\.julia\packages\DiskArrays\VITIC\src\subarrays.jl:8
 [7] maybeview at .\views.jl:133 [inlined]
 [8] dotview(::ZArray{Int64,2,Zarr.BloscCompressor,DirectoryStore}, ::CartesianIndices{2,Tuple{UnitRange{Int64},UnitRange{Int64}}}) at .\broadcast.jl:1160
 [9] top-level scope at REPL[3]:1
 [10] include_string(::Function, ::Module, ::String, ::String) at .\loading.jl:1088

Google Cloud storage

I am using your package on the pangeo cloud (on google cloud) with the STAC package (https://github.com/JuliaClimate/STAC.jl). In order to use the Google's storage REST API, I needed to implement a couple of methods and a custom storage type (https://github.com/JuliaClimate/STAC.jl/blob/main/examples/Julia-pangeo-STAC-Zarr.ipynb). Thanks to the nice API of Zarr.jl this can be done without changing anything in Zarr.jl but I am wondering if it would not be better to integrate this in your package. I would be happy to create a PR.

Initially, I wanted to use GoogleCloud.jl, but there are some open PRs that need to be merged first (e.g.
JuliaCloud/GoogleCloud.jl#32 )

Benchmark vs ZFP

zfp is a library by LLNL that also does multidimensional array compression.

It would be nice to see benchmarks against their vectorized (SIMD) algorithms.

You can easily use their library via ]add zfp_jll on recent Julia versions and call the shared library methods.

cc @aviks, who helped setup the zfp_jll wrapper and might have some usage examples lying around.

Zarr.is_zarray with empty string fails on Julia nightly

I just saw that while looking at the CI results for the doc deployment and it only fails on Julia nightly.

Test threw exception
  Expression: !(Zarr.is_zgroup(ds, ""))
  ArgumentError: embedded NULs are not allowed in C strings: "\0?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<ListBucketResult xmlns=\"[http://s3.amazonaws.com/doc/2006-03-01/\](http://s3.amazonaws.com/doc/2006-03-01//)"><Name>zarrdata</Name><Prefix>.zgroup</Prefix><KeyCount>0</KeyCount><MaxKeys>1</MaxKeys><Delimiter></Delimiter><IsTruncated>false</IsTruncated></ListBucketResult>"
  Stacktrace:
    [1] unsafe_convert
      @ ./strings/cstring.jl:85 [inlined]
    [2] macro expansion
      @ ~/.julia/packages/EzXML/DL8na/src/error.jl:50 [inlined]
    [3] parsexml(xmlstring::String; options::@Kwargs{})
      @ EzXML ~/.julia/packages/EzXML/DL8na/src/document.jl:114
    [4] parsexml
      @ ~/.julia/packages/EzXML/DL8na/src/document.jl:109 [inlined]
    [5] parse_xml
      @ ~/.julia/packages/XMLDict/vlQGP/src/XMLDict.jl:60 [inlined]
    [6] _read(io::IOBuffer, ::MIME{Symbol("application/xml")})
      @ AWS ~/.julia/packages/AWS/Fxun1/src/utilities/response.jl:61
    [7] #54
      @ ~/.julia/packages/AWS/Fxun1/src/utilities/response.jl:53 [inlined]
    [8] _rewind(f::AWS.var"#54#55"{typeof(AWS._read), MIME{Symbol("application/xml")}}, io::IOBuffer)
      @ AWS ~/.julia/packages/AWS/Fxun1/src/utilities/response.jl:120
    [9] parse
      @ ~/.julia/packages/AWS/Fxun1/src/utilities/response.jl:52 [inlined]
   [10] parse(r::AWS.Response{IOBuffer}, mime::MIME{Symbol("application/xml")})
      @ AWS ~/.julia/packages/AWS/Fxun1/src/utilities/response.jl:58
   [11] parse(args::AWS.Response{IOBuffer})
      @ AWSS3 ~/.julia/packages/AWSS3/A97Rh/src/AWSS3.jl:85
   [12] _s3_exists_file(aws::MinioConfig, bucket::String, path::String)
      @ AWSS3 ~/.julia/packages/AWSS3/A97Rh/src/AWSS3.jl:263
   [13] s3_exists_unversioned
      @ ~/.julia/packages/AWSS3/A97Rh/src/AWSS3.jl:358 [inlined]
   [14] #s3_exists#13
      @ ~/.julia/packages/AWSS3/A97Rh/src/AWSS3.jl:388 [inlined]
   [15] s3_exists(aws::MinioConfig, bucket::String, path::String)
      @ AWSS3 ~/.julia/packages/AWSS3/A97Rh/src/AWSS3.jl:384
   [16] isinitialized(s::S3Store, i::String)
      @ Zarr ~/work/Zarr.jl/Zarr.jl/src/Storage/s3store.jl:52
   [17] is_zgroup
      @ ~/work/Zarr.jl/Zarr.jl/src/Storage/Storage.jl:82 [inlined]
   [18] macro expansion
      @ ~/work/Zarr.jl/Zarr.jl/test/storage.jl:15 [inlined]
   [19] macro expansion
      @ /opt/hostedtoolcache/julia/nightly/x64/share/julia/stdlib/v1.12/Test/src/Test.jl:676 [inlined]
   [20] test_store_common(ds::S3Store)
      @ Main ~/work/Zarr.jl/Zarr.jl/test/storage.jl:15

size is not type stable

julia> zcube = zcreate(Float64, 3, 3, 3)
ZArray{Float64} of size 3 x 3 x 3

julia> @code_warntype size(zcube)
MethodInstance for size(::ZArray{Float64, 3, Zarr.BloscCompressor, Zarr.DictStore})
  from size(z::ZArray) in Zarr at /home/gkraemer/.julia/packages/Zarr/T6jFD/src/ZArray.jl:36
Arguments
  #self#::Core.Const(size)
  z::ZArray{Float64, 3, Zarr.BloscCompressor, Zarr.DictStore}
Body::Any
1 ─ %1 = Base.getproperty(z, :metadata)::Zarr.Metadata{Float64, 3, Zarr.BloscCompressor}
│   %2 = Base.getproperty(%1, :shape)::Ref{Tuple{Int64, Int64, Int64}}
│   %3 = Base.getindex(%2)::Any
└──      return %3

see JuliaDataCubes/YAXArrays.jl#148

Parallel reads are slow

Hello, thanks for this package! I'm exploring switching to Zarr.jl from HDF5.jl, but I'm having trouble with performance, so wanted to check if I'm missing something, or if this should be a feature enhancement request. At the moment, multithreading seems to be slower than single-threading in Zarr.jl, and HDF5.jl is ~3x faster for multiprocessing.

using Zarr, HDF5, BenchmarkTools, Distributed
import Base.Threads: @threads

addprocs(36)
@everywhere using HDF5, Zarr

path = joinpath(mktempdir(), "test.zarr")
M, N = 1000, 10000
z1 = zcreate(Float32, M,N,path = path,chunks=(M/10, N/10))
m1 = rand(Float32, M,N);
z1[:,:] .= m1
h5_path = joinpath(mktempdir(), "test.h5")
h5 = h5open(h5_path, "w")
h5["a"] = m1;
close(h5)
h5 = h5open(h5_path, "r"; swmr=true)

function threads_sum(arr)
    result = zeros(size(arr,1))

    @threads for i = 1:size(arr,1)
        result[i] = sum(arr[i,:])
    end
    result
end

@everywhere function distributed_sum(arr::HDF5.File; dset="a")
    @distributed vcat for i = 1:size(arr[dset],1)
        h5 = h5open(arr.filename)
        array = h5[dset]
        sum(array[i,:])
        close(h5)
    end
end

@everywhere function distributed_sum(arr::ZArray)
    @distributed vcat for i = 1:size(arr,1)
        sum(arr[i,:])
    end
end

function seq_sum(arr)
    result = zeros(size(arr,1))

    for i = 1:size(arr,1)
        result[i] = sum(arr[i,:])
    end
    result
end
##
@btime seq_sum(m1); # 44ms
@btime threads_sum(m1); # 3ms
##
@btime seq_sum(h5["a"]); # 4.37s
@btime seq_sum(z1); # 2.08s
@btime threads_sum(z1); # 3.36s
@btime distributed_sum(h5) # 194ms
@btime distributed_sum(z1) # 590ms

zopen creates unwanted empty folders in misspelled path

When I misspell the path to a zarr file, the zopen creates the underlying wrong path.
I think, it would be good to error before constructing the folders that are surely unwanted.
I suspect it comes from the call of mkpath in the DirectoryStore constructor.

julia> using Zarr
julia> zopen("zarrpath")
ERROR: ArgumentError: Specified store (zarrpath) is neither a ZArray nor a ZGroup
Stacktrace:
 [1] zopen(s::DirectoryStore, mode::String; consolidated::Bool)
   @ Zarr ~/.julia/packages/Zarr/EjE1w/src/ZGroup.jl:67
 [2] zopen(s::DirectoryStore, mode::String)
   @ Zarr ~/.julia/packages/Zarr/EjE1w/src/ZGroup.jl:60
 [3] #zopen#75
   @ ~/.julia/packages/Zarr/EjE1w/src/ZGroup.jl:77 [inlined]
 [4] zopen (repeats 2 times)
   @ ~/.julia/packages/Zarr/EjE1w/src/ZGroup.jl:77 [inlined]
 [5] top-level scope
   @ REPL[40]:1

shell> ls
cubes  ls  somepath  zarrpath

Support for async by default for HTTP/S3/GCS storage?

I have been benchmarking zarr access times in both Python and Julia and noticed that Zarr.jl was an order of magnitude slower. I read through the discussion in #65 and tried some of the tips mentioned but was still unable to fully bridge the gap between both versions. A key reason is because I am using zarr-python by itself to load the store in my benchmarks rather than using xarray/dask, the latter adds a ton of overhead in of itself with recent versions of dask. Here are my current benchmark results to help illustrate.

Zarr.jl

using Zarr

Zarr.Blosc.set_num_threads(8) # To match blosc defaults for python
g = zopen("https://mur-sst.s3.us-west-2.amazonaws.com/zarr-v1", consolidated=true)
s = g["analysed_sst"]
@time s[1:3600, 1:1799, 1:500]
# 27.208033 seconds (1.07 M allocations: 12.341 GiB, 4.61% gc time, 0.92% compilation time)
# Note I did this multiple times to eliminate compilation overhead. The machine I am testing on has around 3 Gbps of bandwidth.
import zarr

g = zarr.open_consolidated("https://mur-sst.s3.us-west-2.amazonaws.com/zarr-v1")
s = g.analysed_sst
%time s[:500, :1799, :3600] # 100 chunks
# CPU times: user 3.25 s, sys: 5.1 s, total: 8.34 s
# Wall time: 3.6 s

When attempting to manually do async on each chunk I get:

function async_read(xin)
    xout = similar(xin)
    @sync for i in Zarr.DiskArrays.eachchunk(xin)
       @async xout[i...] = xin[i...]
    end
    xout
end

v = view(s, 1:3600, 1:1799, 1:500)
@time async_read(v)
20.875799 seconds (94.77 k allocations: 24.339 GiB, 4.41% gc time)
3600×1799×500 Array{Int16, 3}:

This is barely any improvement and now memory usage has doubled. What I am suspecting is that the actual reason for the differences is that zarr-python has async built-in through fsspec, while Zarr.jl does not have this implemented in the storage backends. I think to bridge the gap it may be necessary to directly read the chunks asynchronously at the library level for the HTTP/GCS/S3 storage backends, doing it the way I have is probably creating a lot of overhead by allocating many small arrays for the chunks through calling getindex each time instead of allocating only one big array in one go.

pinging @meggart and @alex-s-gardner in case you can suggest better methods mentioned than what I have tried so far.

Dependency management

To everyone interested in this package. I am currently working on implementing a Zarr backend for LMDB, which would introduce another dependency. There are many more possible backends and compressors and I was wondering how to decide if we should add one to the Zarr base package or to make an extra package that implements the backend in question. Is there already a need for something like a pure Zarr package that only has a few dependencies and where backends and compressors can be loaded on demand? Or shall we just continue adding dependencies as new backends are implemented. Is it an option to just wait until Julia gets a better optional dependency system than what Requires.jl currently offers? Any opinions and suggestions are very welcome.

Get tests running using PyCall zarr dependency

The next step for me would be to create unit tests for reading and writing data to and from Directory stores. For this I would like to add Conda and PyCall as a test dependency to test against the latest version of zarr, i.e. we write a dataset with different chunkings, encodings, etc and test if the python zarr implementation can read it and if the values match and the other way around.

We probably want PyCall and Conda only as a Test-specific dependency and this is going to be simple to specifiy from Julia 1.2 https://julialang.github.io/Pkg.jl/dev/creating-packages/#Test-specific-dependencies-1 , so @visr would you agree to set the lower bound of the package compatibility to Julia 1.2? I think it should be ok, since we are in very early stage anyway.

adding and editing attributes to an existing zarr array

I am wondering if it is possible to add or edit attributes of a zarr array after it has been created.

Here is a naive attempt to do so:

using Zarr 
fname = tempname()
z1 = zcreate(Int, 1000,1000,path = fname,chunks=(100, 100),attrs = Dict(
    "bar" => "foo",
))
z2 = zopen(fname,"w")
z2.attrs["bar"]  = 42
# need a zclose?

z2 = zopen(fname,"r")
z2.attrs["bar"]
# still "foo"

Maybe we would need a zclose that synchronizes z2.attrs with the json file .zattrs.

I am using Zarr v0.9.1.

Performance

I've been playing around with Zarr a bit and it would seem that there is some very large differences between a native array and a Zarr. Just wondering if there is something I'm missing here? eg

using Zarr

function test()
    d = (10,10,10)
    c = (100,100,100)

    z2 = zeros(Float32, d)
    @show typeof(z2), size(z2)
    @time for i in CartesianIndices(z2)
        z2[i] = i[1]+i[2]+i[3]
    end
    @time for i in CartesianIndices(z2)
        z2[i] = i[1]+i[2]+i[3]
    end

    z1 = zzeros(Float32, d...,
                compressor=Zarr.NoCompressor(),
                chunks=c)
    @show z1
    @time for i in CartesianIndices(z1)
        z1[i] = i[1]+i[2]+i[3]
    end
    @time for i in CartesianIndices(z1)
        z1[i] = i[1]+i[2]+i[3]
    end

end

test()

(typeof(z2), size(z2)) = (Array{Float32, 3}, (10, 10, 10))
0.000005 seconds
0.000005 seconds
z1 = ZArray{Float32} of size 10 x 10 x 10
9.879543 seconds (67.00 k allocations: 7.453 GiB, 0.35% gc time)
9.857349 seconds (67.00 k allocations: 7.453 GiB, 0.13% gc time)

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

help with CMIP6 / Google Cloud binder

I am trying to prepare a binder to showcase reading CMIP6 data with Zarr.jl from Google Cloud. Here is the repo:

https://github.com/pangeo-data/pangeo-julia-examples/

I'm trying to just reproduce what is shown here: https://meggart.github.io/Zarr.jl/latest/s3examples/#Accessing-CMIP6-data-on-GCS-1

Here is the binder link:
https://mybinder.org/v2/gh/pangeo-data/pangeo-julia-examples/master?filepath=CMIP6-GCS-Julia.ipynb

However, my kernel just hangs every time it comes to actually read data from GCS. I end up having to restart the kernel. So nothing really works.

My Julia skills are very limited, but this is very important to me. I really have no idea how to debug. I'd appreciate any advice you have about how to get this binder working. PR welcome!

Unable to read data of type Char

dc = Zarr.zopen("http://its-live-data.s3.amazonaws.com/test_datacubes/Malaspina_succeeded_ITS_LIVE_vel_EPSG3413_G0120_X-3250000_Y250000.zarr")

numerical data can be read in just fine, e.g.:

dc["mid_date"][:]

character data can not be read in:

dc["sensor_img1"][:]

I do not believe this was an issue with earlier versions of Zarr.jl

#and here's the stack trace:

ERROR: MethodError: no method matching zero(::Type{Char})
Closest candidates are:
  zero(::Union{Type{P}, P}) where P<:Dates.Period at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Dates/src/periods.jl:53
  zero(::Union{Type{<:Zarr.DateTime64}, Zarr.DateTime64}) at ~/.julia/packages/Zarr/ol3aJ/src/metadata.jl:47
  zero(::Union{Zarr.ASCIIChar, Type{Zarr.ASCIIChar}}) at ~/.julia/packages/Zarr/ol3aJ/src/metadata.jl:22
  ...
Stacktrace:
 [1] _zero(T::Type)
   @ Zarr ~/.julia/packages/Zarr/ol3aJ/src/ZArray.jl:131
 [2] getchunkarray(z::ZArray{Char, 1, Zarr.BloscCompressor, Zarr.ConsolidatedStore{Zarr.HTTPStore}})
   @ Zarr ~/.julia/packages/Zarr/ol3aJ/src/ZArray.jl:135
 [3] readblock!(aout::Vector{Char}, z::ZArray{Char, 1, Zarr.BloscCompressor, Zarr.ConsolidatedStore{Zarr.HTTPStore}}, r::CartesianIndices{1, Tuple{Base.OneTo{Int64}}}; readmode::Bool)
   @ Zarr ~/.julia/packages/Zarr/ol3aJ/src/ZArray.jl:149
 [4] readblock!(aout::Vector{Char}, z::ZArray{Char, 1, Zarr.BloscCompressor, Zarr.ConsolidatedStore{Zarr.HTTPStore}}, r::CartesianIndices{1, Tuple{Base.OneTo{Int64}}})
   @ Zarr ~/.julia/packages/Zarr/ol3aJ/src/ZArray.jl:143
 [5] readblock!(a::ZArray{Char, 1, Zarr.BloscCompressor, Zarr.ConsolidatedStore{Zarr.HTTPStore}}, aout::Vector{Char}, i::Base.OneTo{Int64})
   @ Zarr ~/.julia/packages/Zarr/ol3aJ/src/ZArray.jl:174
 [6] getindex_disk(a::ZArray{Char, 1, Zarr.BloscCompressor, Zarr.ConsolidatedStore{Zarr.HTTPStore}}, i::Function)
   @ DiskArrays ~/.julia/packages/DiskArrays/VGEpq/src/diskarray.jl:31
 [7] getindex(a::ZArray{Char, 1, Zarr.BloscCompressor, Zarr.ConsolidatedStore{Zarr.HTTPStore}}, i::Function)
   @ DiskArrays ~/.julia/packages/DiskArrays/VGEpq/src/diskarray.jl:177
 [8] top-level scope
   @ REPL[127]:1

Package improvements

This issue is to collect todo's for this package and to distribute work, if anyone stumbles over this and wants to help:

  • write documentation, a good place to start would be to translate this tutorial https://zarr.readthedocs.io/en/stable/tutorial.html into a proper doctest
    • part of the tutorial have been translated to a doctest
    • initial docs are in place but need toe be extended
  • add support for appending and resizing arrays
  • add string support
    • simple ASCIIStrings are supported, but no more complicated strings
  • add support for more advanced indexing, like Bitmask indexing, or what they call orthogonal indexing. We probably can not rely on AbstractArray fallback methods, because this would be performance-nightmare
  • for the future, customize broadcast to work efficiently with these arrays. Maybe this can be done through other packages like Dagger, but one would have to find out how to link these
  • think about thread-safety, maybe by applying file-locks https://github.com/vtjnash/Pidfile.jl
  • implement some cloud backends (S3, Google Cloud Store...)

License

Just noticed there is no LICENSE file yet. Just MIT I guess?

`fill_value` should provide default value

From Zarr spec,

fill_value
A scalar value providing the default value to use for uninitialized portions of the array, or null if no fill_value is to be used.

I think that fill_value = null would correspond to missing in Julia.

Python Zarr also has the following example:

>>> import zarr
>>> z = zarr.full((10000, 10000), chunks=(1000, 1000), fill_value=42)
>>> z
<zarr.core.Array (10000, 10000) float64>
>>> z[:2, :2]
array([[42., 42.],
       [42., 42.]])

Modifying slightly, we can save to disk with z1 = zarr.open_array('/tmp/example.zarr', mode='w', shape=(10,10),fill_value=42). When we open with Zarr.jl,

julia> z1 = zopen("/tmp/example.zarr")
ZArray{Union{Missing, Float64}} of size 10 x 10

julia> z1[1:2,1:2]
2×2 reshape(::Matrix{Union{Missing, Float64}}, 2, 2) with eltype Union{Missing, Float64}:
 missing  missing
 missing  missing

Per spec & python compatibility, this should be [42 42; 42 42]. I'm not sure how hard this would be to change, but I'm happy to look into it if a PR would be welcome? I think this would also resolve #39

If metadata.fill_value is nothing

If the fill_value for a ZArray is decoded as nothing and the user attempts to access the array at out of bounds indices, an error occurs during filling. I understand from the Zarr specification that accessing uninitialized (not existing) chunks should return the fill value, in this case nothing. The concrete issue here is the pre-allocation of a fixed type for the array. Maybe the arrays should be pre-allocated as Array{Union{T, Nothing}}?

Change the fill_value from within Julia

Is there a way to set the fill_value from within Julia.

I can change the fill_value by manual setting it in the .zarray file inside my data folder, but it would be nicer to set it from within Julia.

If I try to naively set it in the metadata type of my ZArray, I get the following error:

julia> zc2.metadata.fill_value = -99
ERROR: setfield! immutable struct of type Metadata cannot be changed
Stacktrace:
 [1] setproperty!(::Zarr.Metadata{Union{Missing, Float32},3,Zarr.BloscCompressor}, ::Symbol, ::Int64) at ./Base.jl:34
 [2] top-level scope at REPL[55]:1
 [3] eval(::Module, ::Any) at ./boot.jl:331
 [4] eval_user_input(::Any, ::REPL.REPLBackend) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
 [5] run_backend(::REPL.REPLBackend) at /home/crem_fe/.julia/packages/Revise/XFtoQ/src/Revise.jl:1162
 [6] top-level scope at REPL[1]:0

Improve S3Store

Writing data to an S3 Storage as well as some way to parallelize downloading of data when requesting many chunks needs to be implemented.

Finding an org for Zarr.jl

In the discussion at JuliaClimate JuliaClimate/meta#1 it was suggested to move Zarr.jl to the JuliaClimate org and there were a few thumbs-ups for this. Do we see this as a consensus to move the package there? Or is additional discussion necessary. @visr @Balinus

how should `"fill_value": null` be read?

We have a zarray with the following .zarray:

{"dtype":"<f4","filters":null,"shape":[15706,720,1440],"order":"C","zarr_format":2,"chunks":[5844,60,60],"fill_value":null,"compressor":{"blocksize":0,"id":"blosc","clevel":5,"cname":"lz4","shuffle":1}}

when reading data from the directory I get the following error

julia> data = zopen(path)
julia> data[1, 1, 1]
ERROR: MethodError: Cannot `convert` an object of type Nothing to an object of type Float32
Closest candidates are:
  convert(::Type{T}, ::Base.TwicePrecision) where T<:Number at twiceprecision.jl:273
  convert(::Type{T}, ::AbstractChar) where T<:Number at char.jl:185
  convert(::Type{T}, ::CartesianIndex{1}) where T<:Number at multidimensional.jl:130
  ...
Stacktrace:
  [1] fill!(dest::Array{Float32, 3}, x::Nothing)
    @ Base ./array.jl:351
  [2] readchunk!(a::Array{Float32, 3}, z::ZArray{Float32, 3, Zarr.BloscCompressor, DirectoryStore}, i::CartesianIndex{3})
    @ Zarr ~/.julia/packages/Zarr/tmr2s/src/ZArray.jl:188
  [3] (::Zarr.var"#66#69"{Bool, OffsetArrays.OffsetArray{Float32, 3, Array{Float32, 3}}, ZArray{Float32, 3, Zarr.BloscCompressor, DirectoryStore}, CartesianIndices{3, Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}}, Array{Float32, 3}})(bI::CartesianIndex{3})
    @ Zarr ~/.julia/packages/Zarr/tmr2s/src/ZArray.jl:161
  [4] foreach
    @ ./abstractarray.jl:2774 [inlined]
  [5] readblock!(aout::Array{Float32, 3}, z::ZArray{Float32, 3, Zarr.BloscCompressor, DirectoryStore}, r::CartesianIndices{3, Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}}; readmode::Bool)
    @ Zarr ~/.julia/packages/Zarr/tmr2s/src/ZArray.jl:151
  [6] readblock!
    @ ~/.julia/packages/Zarr/tmr2s/src/ZArray.jl:143 [inlined]
  [7] readblock!
    @ ~/.julia/packages/Zarr/tmr2s/src/ZArray.jl:174 [inlined]
  [8] getindex_disk
    @ ~/.julia/packages/DiskArrays/VGEpq/src/diskarray.jl:31 [inlined]
  [9] getindex(::ZArray{Float32, 3, Zarr.BloscCompressor, DirectoryStore}, ::Int64, ::Int64, ::Int64)
    @ DiskArrays ~/.julia/packages/DiskArrays/VGEpq/src/diskarray.jl:177
 [10] top-level scope
    @ REPL[30]:1

I think the error comes from fill_value:null. The data can be read with xarray. Is this error correct behavior?

@MartinuzziFrancesco

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.