Giter Club home page Giter Club logo

avro.jl's People

Contributors

andyferris avatar anhi avatar quinnj avatar twadleigh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

avro.jl's Issues

`DateTime` encoding defaults to microseconds?

Hi @quinnj,

Just starting using this great package - seems perfect for our use! Thanks.

I was curious why the default schematype for DateTimeis in microseconds? It would seem more natural to export milliseconds by default given that is the precision used in Julia.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Problem writing string columns to avro file

When I try to convert an Arrow file to Avro using this code:

function t_arrowtoavro(path, file)
    a = Arrow.Table(joinpath(path, "$(file).arrow")) |> DataFrame
    println(Tables.schema(a))
    Avro.writetable(joinpath(path, "$(file).avro"), a; compress=:zstd)
end

I get the following error whenever the input file contains a string column, e.g.

Tables.Schema:
 :IndividualId  Int32
 :ResultDate    Union{Missing, Date}
 :HIVResult     Union{Missing, String}
ERROR: ArgumentError: internal writing error: buffer too small, len = 1563130

Whereas this version does not produce this error:

function t_arrowtoavro(path, file)
    a = Arrow.Table(joinpath(path, "$(file).arrow")) |> DataFrame
    println(Tables.schema(a))
    b = select(a, :IndividualId, :ResultDate)
    println(Tables.schema(b))
    Avro.writetable(joinpath(path, "$(file).avro"), b; compress=:zstd)
end

Output:

Tables.Schema:
 :IndividualId  Int32
 :ResultDate    Union{Missing, Date}
 :HIVResult     Union{Missing, String}
Tables.Schema:
 :IndividualId  Int32
 :ResultDate    Union{Missing, Date}
"D:\\Data\\Demography\\AHRI\\Staging\\HIVResults.avro"

How do I resolve this problem if I need a string column in my output file?

write fails on large dataframes

hi :)

as indicated in the Tables.jl docs, for large dataframes (>10k columns), Tables.Schema stores its values in
.storenames and .storedtypes instead of the type parameters. This causes e.g. Avro.writetable to crash because names/types is nothing.

signed(UInt8) is not defined until Julia 1.5

We should probably either move up the project dependency to Julia 1.5 or else workaround the issue that is making all the CI fail.

https://github.com/JuliaData/Avro.jl/blob/main/src/types/binary.jl#L151

   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.3.1 (2019-12-30)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> signed(UInt8)
ERROR: MethodError: Cannot `convert` an object of type Type{UInt8} to an object of type Signed
Closest candidates are:
  convert(::Type{T}, ::T) where T<:Number at number.jl:6
  convert(::Type{T}, ::Number) where T<:Number at number.jl:7
  convert(::Type{T}, ::Ptr) where T<:Integer at pointer.jl:23
  ...
Stacktrace:
 [1] signed(::Type) at ./int.jl:170
 [2] top-level scope at REPL[1]:1

Error writing empty nested arrays

I was getting assertion errors while writing some data to disk and have reduced it to this minimal repro:

julia> Avro.write([Bool[]])
ERROR: AssertionError: nb == pos - startpos
Stacktrace:
 [1] writevalue(B::Avro.Binary, A::Avro.ArrayType, x::Vector{Vector{Bool}}, buf::Vector{UInt8}, pos::Int64, len::Int64, opts::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Avro ~/.julia/packages/Avro/JEoRa/src/types/arrays.jl:44
 [2] write(obj::Vector{Vector{Bool}}; schema::Avro.ArrayType, jsonencoding::Bool, kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Avro ~/.julia/packages/Avro/JEoRa/src/types/binary.jl:26
 [3] write(obj::Vector{Vector{Bool}})
   @ Avro ~/.julia/packages/Avro/JEoRa/src/types/binary.jl:22
 [4] top-level scope
   @ REPL[31]:1

Something appears to be going wrong with empty arrays (but only if they are nested inside of other arrays).

Julia 1.6.2, Avro 1.0.0.

Is this package still maintained?

This package hasn't seen any updates in 2 years. Is it still working in 2023?

@quinnj it seems you have been responsible for most of the good stuff that has been done around here; will you please answer?

Error writing maps with exactly 64 entries

It took me a while to figure out this particular problem. Even after #10 I still have trouble writing a small subset of our data, and I was suprised to narrow it down to dictionaries with exactly 64 elements. Here's a minimal example:

julia> Avro.write([Dict("$i" => i for i in 1:64)])
ERROR: AssertionError: nb == pos - startpos
Stacktrace:
 [1] writevalue(B::Avro.Binary, A::Avro.ArrayType, x::Vector{Dict{String, Int64}}, buf::Vector{UInt8}, pos::Int64, len::Int64, opts::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Avro ~/.julia/dev/Avro/src/types/arrays.jl:44
 [2] write(obj::Vector{Dict{String, Int64}}; schema::Avro.ArrayType, jsonencoding::Bool, kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Avro ~/.julia/dev/Avro/src/types/binary.jl:26
 [3] write(obj::Vector{Dict{String, Int64}})
   @ Avro ~/.julia/dev/Avro/src/types/binary.jl:22
 [4] top-level scope
   @ REPL[51]:1

It works fine with 63 or 65 entries. Like #9, the number of bytes written is not match nbytes. The value of 64 is suggestive but I don't see an obvious problem with that:

julia> Avro.nbytes(64)
2

julia> Avro.write(64)
2-element Vector{UInt8}:
 0x80
 0x01

julia> Avro.read([0x80, 0x01], Int)
64

Searching all lengths exhaustively from 0:2^15 I find the same issue at 8192. This is another point where the integer adds a byte, [0x80, 0x80, 0x01]. I get the same error at 2^20, so there is a pattern here. Does this suggest anything to you @quinnj ?

Read and write avro one record at a time

I have a use case where the file I would be writing to or reading from is too large to be in memory. However I only need to access one record at a time.

Is there currently any way I can ensure only one record being in memory at any time similar to CSV.Rows?
Is there a way to write to the avro file row by row?

Data is always written without compression

Experimenting with the library, it seems that the data is compressed during writing, but the original, uncompressed data is written out (tables.jl line 106 writes bytes instead of finalbytes).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.