juliadata / avro.jl Goto Github PK

View Code? Open in Web Editor NEW

15.0 15.0 6.0 133 KB

Pure Julia implementation for reading/writing data in the Avro format

License: MIT License

Julia 100.00%

avro.jl's People

Contributors

Stargazers

Watchers

Forkers

anhi twadleigh standardgalactic altre stjordanis

avro.jl's Issues

`DateTime` encoding defaults to microseconds?

Hi @quinnj,

Just starting using this great package - seems perfect for our use! Thanks.

I was curious why the default schematype for DateTimeis in microseconds? It would seem more natural to export milliseconds by default given that is the precision used in Julia.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Add test case(s) for reading some pre-written object files

I ran into an issue reading one of my own. I think I debugged it, but it would be good for the sake avoiding regressions to test for this going forward.

Maybe generate some examples with pyavro.

Problem writing string columns to avro file

When I try to convert an Arrow file to Avro using this code:

function t_arrowtoavro(path, file)
    a = Arrow.Table(joinpath(path, "$(file).arrow")) |> DataFrame
    println(Tables.schema(a))
    Avro.writetable(joinpath(path, "$(file).avro"), a; compress=:zstd)
end

I get the following error whenever the input file contains a string column, e.g.

Tables.Schema:
 :IndividualId  Int32
 :ResultDate    Union{Missing, Date}
 :HIVResult     Union{Missing, String}
ERROR: ArgumentError: internal writing error: buffer too small, len = 1563130

Whereas this version does not produce this error:

function t_arrowtoavro(path, file)
    a = Arrow.Table(joinpath(path, "$(file).arrow")) |> DataFrame
    println(Tables.schema(a))
    b = select(a, :IndividualId, :ResultDate)
    println(Tables.schema(b))
    Avro.writetable(joinpath(path, "$(file).avro"), b; compress=:zstd)
end

Output:

Tables.Schema:
 :IndividualId  Int32
 :ResultDate    Union{Missing, Date}
 :HIVResult     Union{Missing, String}
Tables.Schema:
 :IndividualId  Int32
 :ResultDate    Union{Missing, Date}
"D:\\Data\\Demography\\AHRI\\Staging\\HIVResults.avro"

How do I resolve this problem if I need a string column in my output file?

write fails on large dataframes

hi :)

as indicated in the Tables.jl docs, for large dataframes (>10k columns), Tables.Schema stores its values in
.storenames and .storedtypes instead of the type parameters. This causes e.g. Avro.writetable to crash because names/types is nothing.

signed(UInt8) is not defined until Julia 1.5

We should probably either move up the project dependency to Julia 1.5 or else workaround the issue that is making all the CI fail.

https://github.com/JuliaData/Avro.jl/blob/main/src/types/binary.jl#L151

   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.3.1 (2019-12-30)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> signed(UInt8)
ERROR: MethodError: Cannot `convert` an object of type Type{UInt8} to an object of type Signed
Closest candidates are:
  convert(::Type{T}, ::T) where T<:Number at number.jl:6
  convert(::Type{T}, ::Number) where T<:Number at number.jl:7
  convert(::Type{T}, ::Ptr) where T<:Integer at pointer.jl:23
  ...
Stacktrace:
 [1] signed(::Type) at ./int.jl:170
 [2] top-level scope at REPL[1]:1

Error writing empty nested arrays

I was getting assertion errors while writing some data to disk and have reduced it to this minimal repro:

julia> Avro.write([Bool[]])
ERROR: AssertionError: nb == pos - startpos
Stacktrace:
 [1] writevalue(B::Avro.Binary, A::Avro.ArrayType, x::Vector{Vector{Bool}}, buf::Vector{UInt8}, pos::Int64, len::Int64, opts::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Avro ~/.julia/packages/Avro/JEoRa/src/types/arrays.jl:44
 [2] write(obj::Vector{Vector{Bool}}; schema::Avro.ArrayType, jsonencoding::Bool, kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Avro ~/.julia/packages/Avro/JEoRa/src/types/binary.jl:26
 [3] write(obj::Vector{Vector{Bool}})
   @ Avro ~/.julia/packages/Avro/JEoRa/src/types/binary.jl:22
 [4] top-level scope
   @ REPL[31]:1

Something appears to be going wrong with empty arrays (but only if they are nested inside of other arrays).

Julia 1.6.2, Avro 1.0.0.

Is this package still maintained?

This package hasn't seen any updates in 2 years. Is it still working in 2023?

@quinnj it seems you have been responsible for most of the good stuff that has been done around here; will you please answer?

Error writing maps with exactly 64 entries

It took me a while to figure out this particular problem. Even after #10 I still have trouble writing a small subset of our data, and I was suprised to narrow it down to dictionaries with exactly 64 elements. Here's a minimal example:

julia> Avro.write([Dict("$i" => i for i in 1:64)])
ERROR: AssertionError: nb == pos - startpos
Stacktrace:
 [1] writevalue(B::Avro.Binary, A::Avro.ArrayType, x::Vector{Dict{String, Int64}}, buf::Vector{UInt8}, pos::Int64, len::Int64, opts::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Avro ~/.julia/dev/Avro/src/types/arrays.jl:44
 [2] write(obj::Vector{Dict{String, Int64}}; schema::Avro.ArrayType, jsonencoding::Bool, kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Avro ~/.julia/dev/Avro/src/types/binary.jl:26
 [3] write(obj::Vector{Dict{String, Int64}})
   @ Avro ~/.julia/dev/Avro/src/types/binary.jl:22
 [4] top-level scope
   @ REPL[51]:1

It works fine with 63 or 65 entries. Like #9, the number of bytes written is not match nbytes. The value of 64 is suggestive but I don't see an obvious problem with that:

julia> Avro.nbytes(64)
2

julia> Avro.write(64)
2-element Vector{UInt8}:
 0x80
 0x01

julia> Avro.read([0x80, 0x01], Int)
64

Searching all lengths exhaustively from 0:2^15 I find the same issue at 8192. This is another point where the integer adds a byte, [0x80, 0x80, 0x01]. I get the same error at 2^20, so there is a pattern here. Does this suggest anything to you @quinnj ?

Read and write avro one record at a time

I have a use case where the file I would be writing to or reading from is too large to be in memory. However I only need to access one record at a time.

Is there currently any way I can ensure only one record being in memory at any time similar to CSV.Rows?
Is there a way to write to the avro file row by row?

Data is always written without compression

Experimenting with the library, it seems that the data is compressed during writing, but the original, uncompressed data is written out (tables.jl line 106 writes bytes instead of finalbytes).

juliadata / avro.jl Goto Github PK

avro.jl's People

Contributors

Stargazers

Watchers

Forkers

avro.jl's Issues

`DateTime` encoding defaults to microseconds?

TagBot trigger issue

Add test case(s) for reading some pre-written object files

Problem writing string columns to avro file

write fails on large dataframes

signed(UInt8) is not defined until Julia 1.5

Error writing empty nested arrays

Is this package still maintained?

Error writing maps with exactly 64 entries

Read and write avro one record at a time

Data is always written without compression

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent