Giter Club home page Giter Club logo

datatoolkitcommon.jl's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

datatoolkitcommon.jl's Issues

Specify storage location for downloaded datasets

Not sure if this is the right repo for this, feel free to move it!

Occasionally, it would be useful to set the location of downloaded datasets, eg in an HPC environment, I want to be able to download to a fast scratch space, rather than a home folder that typically contains my projects. But I wouldn't want to hard-code that location into the Data.toml file, since that wouldn't be portable.

Right now, I typically do something like

path = @load_preference("my_location"; default = get(ENV, "MY_LOCATION", "./my_location))

I think I can see how one could do this using the execution of arbitrary julia code in one of the plugins, but it seems like a common enough ask that it might be worth building it in.

Add HDF5 loader

Some complications that this loader might introduce:

  • a single .h5 file frequently corresponds to several Datasets, and it would be nice if the specification in Data.toml could refer directly to a single DataSet, but preserve the hierarchal relationship between multiple DataSets.
    • More generally, I frequently work with a directory full of .h5 files, and it would be nice to preserve the relationship between all of these files as well.
  • In HDF5, a dataset can have attributes, and it is important that these be accessible from DataToolkit
  • The HDF5.jl package is relatively heavyweight, and so it would be good to not take on an explicit dependency of this package.

Multi-file loader

As recently discussed on Zulip, it would be nice to have a loader which allows loading multiple files that have the same schema, which is already supported by e.g. CSV.jl or Arrow.jl. So I thought I'd make an issue to track this :)

MD5 fails to get checksum with TypeError

Hi,

With the following Data.toml file

data_config_version = 0
uuid = "b5a98bdc-8e4d-450d-bbf7-0f38255133d1"
name = "data"
plugins = ["store", "defaults", "memorise"]

[config.defaults.storage._]
checksum = "md5"


[[test]]
uuid = "3413c2c9-738c-4b92-a089-db554a96531b"

    [[test.storage]]
    driver = "web"
    url = "https://www.random.org/integers/?num=10&min=1&max=6&col=1&base=10&format=plain&rnd=new"

    [[test.loader]]
    driver = "passthrough"

I've got the following error

ERROR: TypeError: in typeassert, expected Vector{UInt8}, got a value of type Base.ReinterpretArray{UInt8, 1, UInt32, Vector{UInt32}, false}
Stacktrace:
  [1] getchecksum(file::String, method::Symbol)
    @ DataToolkitCommon.Store ~/.julia/packages/DataToolkitCommon/SuyzS/src/store/storage.jl:195
  [2] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Base ./essentials.jl:819
  [3] invokelatest(::Any, ::Any, ::Vararg{Any})
    @ Base ./essentials.jl:816
  [4] invokepkglatest(::Any, ::Any, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ DataToolkitBase ~/.julia/packages/DataToolkitBase/JUjFu/src/model/usepkg.jl:83
  [5] invokepkglatest(::Any, ::Any, ::Vararg{Any})
    @ DataToolkitBase ~/.julia/packages/DataToolkitBase/JUjFu/src/model/usepkg.jl:82
  [6] invokepkglatest(::Any, ::Any, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ DataToolkitBase ~/.julia/packages/DataToolkitBase/JUjFu/src/model/usepkg.jl:85
  [7] invokepkglatest(::Any, ::Any, ::Vararg{Any})
    @ DataToolkitBase ~/.julia/packages/DataToolkitBase/JUjFu/src/model/usepkg.jl:82
  [8] invokepkglatest(::Any, ::Any, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ DataToolkitBase ~/.julia/packages/DataToolkitBase/JUjFu/src/model/usepkg.jl:85
  [9] invokepkglatest
    @ ~/.julia/packages/DataToolkitBase/JUjFu/src/model/usepkg.jl:82 [inlined]
 [10] getchecksum(storage::DataToolkitBase.DataStorage, file::String)
    @ DataToolkitCommon.Store ~/.julia/packages/DataToolkitCommon/SuyzS/src/store/storage.jl:228
 [11] storesave(inventory::DataToolkitCommon.Store.Inventory, storage::DataToolkitBase.DataStorage, #unused#::Type{DataToolkitBase.FilePath}, file::DataToolkitBase.FilePath)
    @ DataToolkitCommon.Store ~/.julia/packages/DataToolkitCommon/SuyzS/src/store/storage.jl:287
 [12] (::DataToolkitCommon.Store.var"#55#61")(f::typeof(DataToolkitBase.storage), storer::DataToolkitBase.DataStorage, as::Type; write::Bool)
    @ DataToolkitCommon.Store ~/.julia/packages/DataToolkitCommon/SuyzS/src/store/plugins.jl:184
 [13] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Bool, Tuple{Symbol}, NamedTuple{(:write,), Tuple{Bool}}})
    @ Base ./essentials.jl:821
 [14] invokepkglatest(::Any, ::Any, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Bool, Tuple{Symbol}, NamedTuple{(:write,), Tuple{Bool}}})
    @ DataToolkitBase ~/.julia/packages/DataToolkitBase/JUjFu/src/model/usepkg.jl:83
 [15] (::DataToolkitBase.Advice{typeof(DataToolkitBase.storage), Tuple{DataToolkitBase.DataStorage, Type}})(callform::Tuple{Function, Function, Tuple, NamedTuple})
    @ DataToolkitBase ~/.julia/packages/DataToolkitBase/JUjFu/src/model/advice.jl:18
 [16] call_composed (repeats 8 times)
    @ ./operators.jl:1034 [inlined]
 [17] #_#97
    @ ./operators.jl:1031 [inlined]
 [18] (::ComposedFunction{ComposedFunction{ComposedFunction{ComposedFunction{ComposedFunction{ComposedFunction{ComposedFunction{ComposedFunction{DataToolkitBase.Advice{typeof(DataToolkitBase.tospec), Tuple{DataToolkitBase.AbstractDataTransformer}}, DataToolkitBase.Advice{typeof(DataToolkitBase.tospec), Tuple{DataToolkitBase.DataSet}}}, DataToolkitBase.Advice{typeof(DataToolkitBase.fromspec), Tuple{Type{<:DataToolkitBase.AbstractDataTransformer}, DataToolkitBase.DataSet, Dict{String, Any}}}}, DataToolkitBase.Advice{typeof(DataToolkitBase.fromspec), Tuple{Type{DataToolkitBase.DataSet}, DataToolkitBase.DataCollection, String, Dict{String, Any}}}}, DataToolkitBase.Advice{typeof(DataToolkitCommon.show_extra), Tuple{IO, DataToolkitBase.DataSet}}}, DataToolkitBase.Advice{typeof(DataToolkitBase.init), Tuple{DataToolkitBase.DataCollection}}}, DataToolkitBase.Advice{typeof(DataToolkitCommon.Store.rhash), Tuple{DataToolkitBase.DataStorage, DataToolkitBase.SmallDict, UInt64}}}, DataToolkitBase.Advice{typeof(DataToolkitBase.storage), Tuple{DataToolkitBase.DataStorage, Type}}}, DataToolkitBase.Advice{typeof(DataToolkitBase._read), Tuple{DataToolkitBase.DataSet, Type}}})(x::Tuple{typeof(identity), typeof(DataToolkitBase.storage), Tuple{DataToolkitBase.DataStorage{:web, DataToolkitBase.DataSet}, DataType}, NamedTuple{(:write,), Tuple{Bool}}})
    @ Base ./operators.jl:1031
 [19] (::DataToolkitBase.AdviceAmalgamation)(annotated_func_call::Tuple{Function, Function, Tuple, NamedTuple})
    @ DataToolkitBase ~/.julia/packages/DataToolkitBase/JUjFu/src/model/advice.jl:95
 [20] (::DataToolkitBase.AdviceAmalgamation)(::Function, ::Any, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, V, Tuple{Vararg{Symbol, N}}, NamedTuple{names, T}} where {V, N, names, T<:Tuple{Vararg{Any, N}}})
    @ DataToolkitBase ~/.julia/packages/DataToolkitBase/JUjFu/src/model/advice.jl:100
 [21] open(data::DataToolkitBase.DataSet, as::Type; write::Bool)
    @ DataToolkitBase ~/.julia/packages/DataToolkitBase/JUjFu/src/interaction/externals.jl:311
 [22] open
    @ ~/.julia/packages/DataToolkitBase/JUjFu/src/interaction/externals.jl:308 [inlined]
 [23] _read(dataset::DataToolkitBase.DataSet, as::Type)
    @ DataToolkitBase ~/.julia/packages/DataToolkitBase/JUjFu/src/interaction/externals.jl:218
 [24] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Base ./essentials.jl:819
 [25] invokelatest(::Any, ::Any, ::Vararg{Any})
    @ Base ./essentials.jl:816
 [26] invokepkglatest(::Any, ::Any, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ DataToolkitBase ~/.julia/packages/DataToolkitBase/JUjFu/src/model/usepkg.jl:83
 [27] invokepkglatest(::Any, ::Any, ::Vararg{Any})
    @ DataToolkitBase ~/.julia/packages/DataToolkitBase/JUjFu/src/model/usepkg.jl:82
 [28] (::DataToolkitBase.AdviceAmalgamation)(::Function, ::Any, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, V, Tuple{Vararg{Symbol, N}}, NamedTuple{names, T}} where {V, N, names, T<:Tuple{Vararg{Any, N}}})
    @ DataToolkitBase ~/.julia/packages/DataToolkitBase/JUjFu/src/model/advice.jl:102
 [29] (::DataToolkitBase.AdviceAmalgamation)(::Function, ::Any, ::Vararg{Any})
    @ DataToolkitBase ~/.julia/packages/DataToolkitBase/JUjFu/src/model/advice.jl:98
 [30] macro expansion
    @ ~/.julia/packages/DataToolkitBase/JUjFu/src/model/advice.jl:131 [inlined]
 [31] _dataadvisecall(::typeof(DataToolkitBase._read), ::DataToolkitBase.DataSet, ::Type{IO}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ DataToolkitBase ~/.julia/packages/DataToolkitBase/JUjFu/src/model/advice.jl:131
 [32] _dataadvisecall(::Function, ::DataToolkitBase.DataSet, ::Vararg{Any})
    @ DataToolkitBase ~/.julia/packages/DataToolkitBase/JUjFu/src/model/advice.jl:131
 [33] read(dataset::DataToolkitBase.DataSet)
    @ DataToolkitBase ~/.julia/packages/DataToolkitBase/JUjFu/src/interaction/externals.jl:160
 [34] top-level scope
    @ ~/.julia/packages/DataToolkit/QoiqL/src/DataToolkit.jl:48

Configuration : julia 1.9.4 with

[dc83c90b] DataToolkit v0.8.0
[9e6fccbf] DataToolkitCommon v0.8.0
[6ac74813] MD5 v0.2.2

Thanks !

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Feature request: Arrow loader

A loader for arrow files would be great. I tried to look up how to add a new loader but but it was a bit difficult for me to navigate the codebase. If that's a thing that users could contribute, maybe a tutorial could be helpful.

Lazy loading for Arrow.jl

In the wake of #16 you activated lazy loading for ArchGDAL so that .gpkg data can be loaded even when ArchGDAL was not explicitly imported by the user before. The same seems to be the issue for Arrow files where currently you get an ERROR: TransformerError: Data set ... could not be loaded in any form. when you don't load either the Arrow or DataFrames package first.

add REPL command mismatch behaviour and documentation

The help for the add REPL command states the following syntax, if I understood correctly:

add <name> via <drivers>… from <source>

I ran this command: add postit via -sl passthrough from data/exp_raw/2022/2022-10-18/postit,
but then the word via is also used as a possible Storage or Loader, resulting in the following output:

! Failed to create 'via' Storage
 ! Failed to create 'passthrough' Storage
 ! Failed to create 'via' Loader
 ! Failed to create 'passthrough' Loader

Test all the data transformers

Testing is good. With https://github.com/tecosaur/DataToolkitTestAssets it should be easier to test the transformers.

Storage

  • Filesystem
    • read
    • write
  • Git
    • read
    • write
  • Null
    • read
    • write
  • Passthrough
    • read
    • write
  • Web
    • read
    • write

Loaders/Writers

  • Arrow (read)
  • Arrow (write)
  • Chain (read)
  • Compression, Gzip (read)
  • Compression, Gzip (write)
  • Compression, Zlib (read)
  • Compression, Zlib (write)
  • Compression, Defalate (read)
  • Compression, Deflate (write)
  • Compression, Bzip (read)
  • Compression, Bzip (write)
  • Compression, Xz (read)
  • Compression, Xz (write)
  • Compression, Zstd (read)
  • Compression, Zstd (write)
  • CSV (read)
  • CSV (write)
  • Delim (read)
  • Delim (write)
  • io->file (read)
  • jld2 (read)
  • jld2 (write)
  • jpeg (read)
  • json (read)
  • json (write)
  • julia (read)
  • julia (write)
  • netpbm (read)
  • passthrough (read)
  • passthrough (write)
  • png (read)
  • qoi (read)
  • sqlite (read)
  • sqlite (write)
  • tar (read)
  • tiff (read)
  • xlsx (read)
  • zip (read)

Tutorial for adding a new data loader

As discussed in #10, here is a short docs writeup of the process of creating the Arrow loader as an example of how to add a loader to the package. Let me know what you think and of course feel free to adapt, extend or rephrase! (I would have made a PR but don't understand how the docs work).

Tutorial: Adding a new loader/writer

In case your favourite data format is not supported yet by DataToolkit, fret not! It is relatively straightforward to add a new loader to the package and PRs adding new loaders are welcome. The following will briefly outline the process based on the loader/writer for the arrow format.

Step 1: Adding the main loader / writer functions

Each loader has its own file in the src/transformers/saveload directory. So as a first step, we add a new file arrow.jl there and make sure to include('transformers/saveload/arrow.jl) in src/DataToolkitCommon.jl, next to the other loader files.

We're now ready to add methods to the main package functions responsible for loading and saving data, which are aptly called load and save. Starting with the loader, we add a method to load which dispatches on DataLoader{:arrow}, takes an IO and allows the specification of a sink type to read data into, e.g. a DataFrame. The final function looks as follows:

function load(loader::DataLoader{:arrow}, io::IO, sink::Type)
    @import Arrow
    convert = @getparam loader."convert"::Bool true
    result = Arrow.Table(io; convert) |>
    if sink == Any || sink == Arrow.Table
        identity
    elseif QualifiedType(sink) == QualifiedType(:DataFrames, :DataFrame)
        sink
    end
    result
end

This function includes four things:

  1. An @import statement for the Arrow package which we use for reading a .arrow file
  2. Use of the @getparam macro to obtain arguments to the wrapped loader function (Arrow.Table, in our case) from the Data.toml file and to set their defaults. Here, we just need to specify the single convert argument, but in principle, there can be many.
  3. Reading the data from io, most likely using a package and including the arguments obtained in step 2 (here: Arrow.Table(io; convert)).
  4. Conversion to the specified sink type. Note the use of QualifiedType, which needs to be specified separately.

The file types supported by the loader and resolved in step 4 are specified through inclusion of a method for the supportedtypes function. Here, we specify two possible return types: Arrow.Table, which is returned natively by the Arrow.jl package, and DataFrame from the DataFrames.jl package:

supportedtypes(::Type{DataLoader{:arrow}}) =
    [QualifiedType(:DataFrames, :DataFrame),
     QualifiedType(:Arrow, :Table)]

The writer follows an overall similar structure; @import necessary packages, obtain writer arguments using @getparam and then write the data in tbl to io. Here's the save method for the arrow loader:

function save(writer::DataWriter{:arrow}, io::IO, tbl)
    @import Arrow
    compress         = @getparam writer."compress"::Union{Symbol, Nothing} nothing
    alignment        = @getparam writer."alignment"::Int 8
    dictencode       = @getparam writer."dictencode"::Bool false
    dictencodenested = @getparam writer."dictencodenested"::Bool false
    denseunions      = @getparam writer."denseunions"::Bool true
    largelists       = @getparam writer."largelists"::Bool false
    maxdepth         = @getparam writer."maxdepth"::Int 6
    ntasks           = @getparam writer."ntasks"::Int Int(typemax(Int32))
    Arrow.write(
        io, tbl;
        compress, alignment,
        dictencode, dictencodenested,
        denseunions, largelists,
        maxdepth, ntasks)
end

We also need to add a method to th ecreate function for our loader with a regex to recognize files of our data format:

create(::Type{DataLoader{:arrow}}, source::String) =
    !isnothing(match(r"\.arrow$"i, source))

...and a method to createpriority specifying... TODO: what exactly?

createpriority(::Type{DataLoader{:arrow}}) = 10

Finally, we add a docstring specifying how to use our loader/writer:

const ARROW_DOC = md"""
[...]
"""

That's the full content of the new arrow.jl file!

Step 2: Adding the new loader to package initialization

To make things work, we now just need to add two more things to the __init__() function in src/DataToolkitCommon.jl:

  1. A line specifying the necessary packages used by our loader with their respective UUIDs (which you can obtain from their respective Project.toml): In our case that is @addpkg Arrow "69666777-d1a9-59fb-9406-91d4454c9d45"
  2. A line adding our docstring to the package documentation. In our case, we just add (:loader, :arrow) => ARROW_DOC, to the list of docstrings in the append!(DataToolkitBase.TRANSFORMER_DOCUMENTATION, ...) call further below.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.