juliadata / avro.jl Goto Github PK
View Code? Open in Web Editor NEWPure Julia implementation for reading/writing data in the Avro format
License: MIT License
Pure Julia implementation for reading/writing data in the Avro format
License: MIT License
Hi @quinnj,
Just starting using this great package - seems perfect for our use! Thanks.
I was curious why the default schematype
for DateTime
is in microseconds? It would seem more natural to export milliseconds by default given that is the precision used in Julia.
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
If you'd like for me to do this for you, comment TagBot fix
on this issue.
I'll open a PR within a few hours, please be patient!
I ran into an issue reading one of my own. I think I debugged it, but it would be good for the sake avoiding regressions to test for this going forward.
Maybe generate some examples with pyavro.
When I try to convert an Arrow file to Avro using this code:
function t_arrowtoavro(path, file)
a = Arrow.Table(joinpath(path, "$(file).arrow")) |> DataFrame
println(Tables.schema(a))
Avro.writetable(joinpath(path, "$(file).avro"), a; compress=:zstd)
end
I get the following error whenever the input file contains a string column, e.g.
Tables.Schema:
:IndividualId Int32
:ResultDate Union{Missing, Date}
:HIVResult Union{Missing, String}
ERROR: ArgumentError: internal writing error: buffer too small, len = 1563130
Whereas this version does not produce this error:
function t_arrowtoavro(path, file)
a = Arrow.Table(joinpath(path, "$(file).arrow")) |> DataFrame
println(Tables.schema(a))
b = select(a, :IndividualId, :ResultDate)
println(Tables.schema(b))
Avro.writetable(joinpath(path, "$(file).avro"), b; compress=:zstd)
end
Output:
Tables.Schema:
:IndividualId Int32
:ResultDate Union{Missing, Date}
:HIVResult Union{Missing, String}
Tables.Schema:
:IndividualId Int32
:ResultDate Union{Missing, Date}
"D:\\Data\\Demography\\AHRI\\Staging\\HIVResults.avro"
How do I resolve this problem if I need a string column in my output file?
hi :)
as indicated in the Tables.jl docs, for large dataframes (>10k columns), Tables.Schema stores its values in
.storenames and .storedtypes instead of the type parameters. This causes e.g. Avro.writetable to crash because names/types is nothing.
We should probably either move up the project dependency to Julia 1.5 or else workaround the issue that is making all the CI fail.
https://github.com/JuliaData/Avro.jl/blob/main/src/types/binary.jl#L151
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.3.1 (2019-12-30)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> signed(UInt8)
ERROR: MethodError: Cannot `convert` an object of type Type{UInt8} to an object of type Signed
Closest candidates are:
convert(::Type{T}, ::T) where T<:Number at number.jl:6
convert(::Type{T}, ::Number) where T<:Number at number.jl:7
convert(::Type{T}, ::Ptr) where T<:Integer at pointer.jl:23
...
Stacktrace:
[1] signed(::Type) at ./int.jl:170
[2] top-level scope at REPL[1]:1
I was getting assertion errors while writing some data to disk and have reduced it to this minimal repro:
julia> Avro.write([Bool[]])
ERROR: AssertionError: nb == pos - startpos
Stacktrace:
[1] writevalue(B::Avro.Binary, A::Avro.ArrayType, x::Vector{Vector{Bool}}, buf::Vector{UInt8}, pos::Int64, len::Int64, opts::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Avro ~/.julia/packages/Avro/JEoRa/src/types/arrays.jl:44
[2] write(obj::Vector{Vector{Bool}}; schema::Avro.ArrayType, jsonencoding::Bool, kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Avro ~/.julia/packages/Avro/JEoRa/src/types/binary.jl:26
[3] write(obj::Vector{Vector{Bool}})
@ Avro ~/.julia/packages/Avro/JEoRa/src/types/binary.jl:22
[4] top-level scope
@ REPL[31]:1
Something appears to be going wrong with empty arrays (but only if they are nested inside of other arrays).
Julia 1.6.2, Avro 1.0.0.
This package hasn't seen any updates in 2 years. Is it still working in 2023?
@quinnj it seems you have been responsible for most of the good stuff that has been done around here; will you please answer?
It took me a while to figure out this particular problem. Even after #10 I still have trouble writing a small subset of our data, and I was suprised to narrow it down to dictionaries with exactly 64 elements. Here's a minimal example:
julia> Avro.write([Dict("$i" => i for i in 1:64)])
ERROR: AssertionError: nb == pos - startpos
Stacktrace:
[1] writevalue(B::Avro.Binary, A::Avro.ArrayType, x::Vector{Dict{String, Int64}}, buf::Vector{UInt8}, pos::Int64, len::Int64, opts::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Avro ~/.julia/dev/Avro/src/types/arrays.jl:44
[2] write(obj::Vector{Dict{String, Int64}}; schema::Avro.ArrayType, jsonencoding::Bool, kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Avro ~/.julia/dev/Avro/src/types/binary.jl:26
[3] write(obj::Vector{Dict{String, Int64}})
@ Avro ~/.julia/dev/Avro/src/types/binary.jl:22
[4] top-level scope
@ REPL[51]:1
It works fine with 63 or 65 entries. Like #9, the number of bytes written is not match nbytes
. The value of 64 is suggestive but I don't see an obvious problem with that:
julia> Avro.nbytes(64)
2
julia> Avro.write(64)
2-element Vector{UInt8}:
0x80
0x01
julia> Avro.read([0x80, 0x01], Int)
64
Searching all lengths exhaustively from 0:2^15
I find the same issue at 8192. This is another point where the integer adds a byte, [0x80, 0x80, 0x01]
. I get the same error at 2^20, so there is a pattern here. Does this suggest anything to you @quinnj ?
I have a use case where the file I would be writing to or reading from is too large to be in memory. However I only need to access one record at a time.
Is there currently any way I can ensure only one record being in memory at any time similar to CSV.Rows?
Is there a way to write to the avro file row by row?
Experimenting with the library, it seems that the data is compressed during writing, but the original, uncompressed data is written out (tables.jl line 106 writes bytes instead of finalbytes).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.