Giter Club home page Giter Club logo

lightbson.jl's Introduction

LightBSON

Build Status codecov

High performance encoding and decoding of BSON data.

What It Is

  • Allocation free API for reading and writing BSON data.
  • Natural mapping of Julia types to corresponding BSON types.
  • Convenience API to read and write Dict{String, Any} or LittleDict{String, Any} (default, for roundtrip consistency) as BSON.
  • Struct API tunable for tradeoffs between flexibility, performance, and evolution.
  • Configurable validation levels.
  • Light weight indexing for larger documents.
  • Transducers.jl compatible.
  • Tested for conformance against the BSON corpus.

What It Is Not

  • Generic serialization of all Julia types to BSON. See BSON.jl for that functionality. LightBSON.jl aims for natural representations, suitable for interop with other languages and long term persistence.
  • Integrated with FileIO.jl. BSON.jl already is, and adding another with different semantics would be confusing.
  • A BSON mutation API. Reading and writing are entirely separate and only complete documents can be written.
  • Conversion to and from Extended JSON. This may be added later.

High Level API

Performance sensitive use cases are the focus of this package, but a high level API is available for where performance is less of a concern. The high level API will also perform optimally when reading simple types without strings, arrays, or other fields that require allocation.

  • bson_read([T=Any], src) reads T from src where src is either a DenseVector{UInt8}, an IO object, or a path to a file.
  • bson_write(dst, x) writes x to dst where x is a type that maps to fields (dictionary, struct, named tuple), and dst is a DenseVector{UInt8}, an IO, or a path to a file.
  • bson_write(dst, xs...) writes xs to dst where xs is a list of Pair{String, T} mapping names to field values, and dst is as above.
# Write to a byte buffer and return the buffer
buf = bson_write(UInt8[], :x => Int64(1), :y => Int64(2))
# Read from a byte buffer as a LittleDict{String, Any}
bson_read(buf)
# Read from a byte buffer as a specific type
x = bson_read(@NamedTuple{x::Int64, y::Int64}, buf)
path = joinpath(homedir(), "test.bson")
# Write to a file
bson_write(path, x)
# Read from a file as an LittleDict{String Any}
bson_read(path)

Low Level API

  • Documents are read and written to and from byte arrays with BSONReader and BSONWriter.
  • BSONReader and BSONWriter are immutable struct types with no state. They can be instantiated without allocation.
  • BSONWriter will append to the destination array. User is responsible for not writing duplicate fields.
  • reader["foo"] or reader[:foo] finds foo and returns a new reader pointing to the field.
  • reader[T] materializes a field to the type T.
  • writer["foo"] = x or writer[:foo] = x appends a field with name foo and value x.
  • close(writer) finalizes a document (writes the document length and terminating null byte).
  • Prefer strings for field names. Symbols in Julia unfortunately do not have a constant time length API.

Example

buf = UInt8[]
writer = BSONWriter(buf)
writer["x"] = Int64(123)
close(writer)
reader = BSONReader(buf)
reader["x"][Int64] # 123

Nested Documents

Nested documents are written by assigning a field with a function that takes a nested writer. Nested fields are read by index.

buf = UInt8[]
writer = BSONWriter(buf)
writer["x"] = nested_writer -> begin
    nested_writer["a"] = 1
    nested_writer["b"] = 2
end
close(writer)
reader = BSONReader(buf)
reader["x"]["a"][Int64] # 1
reader["x"]["b"][Int64] # 2

Arrays

Arrays are written by assigning array values, or by generators. Arrays can be materialized to Julia arrays, or be accessed by index.

buf = UInt8[]
writer = BSONWriter(buf)
writer["x"] = Int64[1, 2, 3]
writer["y"] = (Int64(x * x) for x in 1:3)
close(writer)
reader = BSONReader(buf)
reader["x"][Vector{Int64}] # [1, 2, 3]
reader["y"][Vector{Int64}] # [1, 4, 9]
reader["x"][2][Int64] # 2
reader["y"][2][Int64] # 4

Dict and Any

Where performance is not a concern, elements can be materialized to dictionaries (recursively) or to the most appropriate Julia type for leaf fields. Dictionaries can also be written directly.

buf = UInt8[]
writer = BSONWriter(buf)
writer[] = LittleDict{String, Any}("x" => Int64(1), "y" => "foo")
close(writer)
reader = BSONReader(buf)
reader[Dict{String, Any}] # Dict{String, Any}("x" => 1, "y" => "foo")
reader[LittleDict{String, Any}] # LittleDict{String, Any}("x" => 1, "y" => "foo")
reader[Any] # LittleDict{String, Any}("x" => 1, "y" => "foo")
reader["x"][Any] # 1
reader["y"][Any] # "foo"

Arrays With Nested Documents

Generators can be used to directly write nested documents in arrays.

buf = UInt8[]
writer = BSONWriter(buf)
writer["x"] = (
    nested_writer -> begin
        nested_writer["a"] = Int64(x)
        nested_writer["b"] = Int64(x * x)
    end
    for x in 1:3
)
close(writer)
reader = BSONReader(buf)
reader["x"][Vector{Any}] # Any[LittleDict{String, Any}("a" => 1, "b" => 1), LittleDict{String, Any}("a" => 2, "b" => 4), LittleDict{String, Any}("a" => 3, "b" => 9)]
reader["x"][3]["b"][Int64] # 9

Read Abstract Types

The abstract types Number, Integer, and AbstractFloat can be used to materialize numeric fields to the most appropriate Julia type under the abstract type.

buf = UInt8[]
writer = BSONWriter(buf)
writer["x"] = Int32(123)
writer["y"] = 1.25
close(writer)
reader = BSONReader(buf)
reader["x"][Number] # 123
reader["x"][Integer] # 123
reader["y"][Number] # 1.25
reader["y"][AbstractFloat] # 1.25

Unsafe String

String fields can be materialized as WeakRefString{UInt8}, aliased as UnsafeBSONString. This will create a string object with pointers to the underlying buffer without performing any allocations. User must take care of GC safety.

buf = UInt8[]
writer = BSONWriter(buf)
writer["x"] = "foo"
close(writer)
reader = BSONReader(buf)
s = reader["x"][UnsafeBSONString]
GC.@preserve buf s == "foo" # true

Binary

Binary fields are represented with BSONBinary.

buf = UInt8[]
writer = BSONWriter(buf)
writer["x"] = BSONBinary([0x1, 0x2, 0x3], BSON_SUBTYPE_GENERIC_BINARY)
close(writer)
reader = BSONReader(buf)
x = reader["x"][BSONBinary]
x.data # [0x1, 0x2, 0x3]
x.subtype # BSON_SUBTYPE_GENERIC_BINARY

Unsafe Binary

Binary fields can be materialized as UnsafeBSONBinary for zero alocations, where the data field is an UnsafeArray{UInt8, 1} with pointers into the underlying buffer. User must take care of GC safety.

buf = UInt8[]
writer = BSONWriter(buf)
writer["x"] = BSONBinary([0x1, 0x2, 0x3], BSON_SUBTYPE_GENERIC_BINARY)
close(writer)
reader = BSONReader(buf)
x = reader["x"][UnsafeBSONBinary]
GC.@preserve buf x.data == [0x1, 0x2, 0x3] # true
x.subtype # BSON_SUBTYPE_GENERIC_BINARY

Iteration

Fields can be iterated with foreach() or using the Transducers.jl APIs. Fields are represented as Pair{UnsafeBSONString, BSONReader}.

buf = UInt8[]
writer = BSONWriter(buf)
writer["x"] = Int64(1)
writer["y"] = Int64(2)
writer["z"] = Int64(3)
close(writer)
reader = BSONReader(buf)
reader |> Map(x -> x.second[Int64]) |> sum # 6
foreach(x -> println(x.second[Int64]), reader) # 1\n2\n3\n

Indexing

BSON field access involves a linear scan to find the matching field name. Depending on the size and structure of a document, and the fields being accessed, it migh be preferable to build an index over the fields first, to be re-used on every access.

BSONIndex provides a very light weight partial (collisions evict previous entries) index over a document. It is designed to be re-used from document to document, by means of a constant time reset. IndexedBSONReader wraps a reader and an index to accelerate field access in a document. Index misses fall back to the wrapped reader.

buf = UInt8[]
writer = BSONWriter(buf)
writer["x"] = Int64(1)
writer["y"] = Int64(2)
writer["z"] = Int64(3)
close(writer)
index = BSONIndex(128)
# Index is built when IndexBSONReader is constructed
reader = IndexedBSONReader(index, BSONReader(buf))
reader["z"][Int64] # 3 -- accessed by index

empty!(buf)
writer = BSONWriter(buf)
writer["a"] = Int64(1)
writer["b"] = Int64(2)
writer["c"] = Int64(3)
close(writer)
# Index can be re-used
reader = IndexedBSONReader(index, BSONReader(buf))
reader["b"][Int64] # 2 -- accessed by index
reader["x"] # throws KeyError

Validation

BSONReader can be configured with a validator to use during the processing of the input document; 3 are provided:

BSONReader(buf, StrictBSONValidator()) # Reader with strict validation
BSONReader(buf, LightBSONValidator())  # Reader with memory protected validation
BSONReader(buf, UncheckedBSONValidator()) # Reader with no validation

Structs

Structs can be automatically translated to and from BSON, provided all their fields can be represented in BSON. Traits functions are used to select the mode of conversion, and these serve as an extension point for user defined types.

  • bson_simple(T)::Bool - Set this to true if fields to be serialized are given by fieldnames(T) and T can be constructed by fields in order of declaration. Defaults to StructTypes.StructType(T) == StructTypes.NoStructType().

Generic

Provided bson_simple(T) is false, serialization will use the StructTypes.jl API to iterate fields of T and to construct T. See StructTypes.jl for more details.

Simple

For simple types using the bson_simple(T) trait will generate faster serialization code. This needs only be explicitly set if StructTypes.StructType(T) is also defined.

struct NestedSimple
    a::Int64
    b::Float64
end

struct Simple
    x::String
    y::NestedSimple
end

StructTypes.StructType(::Type{Simple}) = StructTypes.Struct()
LightBSON.bson_simple(::Type{Simple}) = true

buf = UInt8[]
writer = BSONWriter(buf)
writer["simple"] = Simple("foo", NestedSimple(123, 1.25))
close(writer)
reader = BSONReader(buf)
reader["simple"][Simple] # Simple("foo", NestedSimple(123, 1.25))

# Structs can also be written to the root of the document
buf = UInt8[]
writer = BSONWriter(buf)
writer[] = Simple("foo", NestedSimple(123, 1.25))
close(writer)
reader = BSONReader(buf)
reader[Simple] # Simple("foo", NestedSimple(123, 1.25))

Schema Evolution

For long term persistence or long lived APIs, it may be advisable to encode information about schema versions in documents, and implement ways to evolve schemas through time. Specific strategies for schema evolution are beyond the scope of this package to advise or impose, rather extension points are provided for users to implement the mechanisms best fit to their use cases.

  • bson_schema_version(T) - The current schema version for T, can be any BSON compatible type, or nothing if T is unversioned. Defaults to nothing.
  • bson_schema_version_field(T) - The field name to use in the BSON document for storing the schema version. Defaults to "_v".
  • bson_read_versioned(T, v, reader) - Handle version v with respect to current version of T and read T from reader. Defaults to error if the schema versions are mismatched, and otherwise read as-if unversioned.
struct Evolving1
    x::Int64
end

LightBSON.bson_schema_version(::Type{Evolving1}) = Int32(1)

struct Evolving2
    x::Int64
    y::Float64
end

# Construct from old version, defaulting new fields
Evolving2(old::Evolving1) = Evolving2(old.x, NaN)

LightBSON.bson_schema_version(::Type{Evolving2}) = Int32(2)

function LightBSON.bson_read_versioned(::Type{Evolving2}, v::Int32, reader::AbstractBSONReader)
    if v == 1
        Evolving2(bson_read_unversioned(Evolving1, reader))
    elseif v == 2
        bson_read_unversioned(Evolving2, reader)
    else
        # Real world application may instead want a mechanism to allow forward compatibility,
        # e.g., by encoding breaking vs non-breaking change info in the version
        error("Unsupported schema version $v for Evolving")
    end
end

const Evolving = Evolving2

buf = UInt8[]
writer = BSONWriter(buf)
# Write old version
writer[] = Evolving1(123)
close(writer)
reader = BSONReader(buf)
# Read as new version
reader[Evolving] # Evolving2(123, NaN)

Named Tuples

Named tuples can be read and written like any struct type.

buf = UInt8[]
writer = BSONWriter(buf)
writer[] = (; x = "foo", y = 1.25, z = (; a = Int64(123), b = Int64(456)))
close(writer)
reader = BSONReader(buf)
reader[@NamedTuple{x::String, y::Float64, z::@NamedTuple{a::Int64, b::Int64}}] # (x = "foo", y = 1.25, z = (a = 123, b = 456))

Faster Buffer

Since BSONWriter itself is immutable, it makes frequent calls to resize the underlying array to track the write head position. Unfortunately at present, this is not a well optimized operation in Julia, resolving to C-calls for manipulating the state of the array.

BSONWriteBuffer wraps a Vector{UInt8} to track size purely in Julia and avoid most of these calls. It implements the minimum API necessary for use with BSONReader and BSONWriter, and is not for use as a general Array implementation.

buf = BSONWriteBuffer()
writer = BSONWriter(buf)
writer["x"] = Int64(123)
close(writer)
reader = BSONReader(buf)
reader["x"][Int64] # 123
buf.data # Underlying array, may be longer than length(buf)

Performance

Performance naturally will depend very much on the nature of data being processed. The main overarching goal with this package is to enable the highest possible performance where the user requires it and is willing to sacrifice some convenience to achieve their target.

General advice for high performance BSON schema, such as short field names, avoiding long arrays or documents, and using nesting to reduce search complexity, all apply.

For LightBSON.jl specifically, prefer strings over symbols for field names, use unsafe variants rather than allocating strings and buffers where possible, reuse buffers and indexes, use BSONWriteBuffer rather than plain Vector{UInt8}, and enable bson_simple(T) for all applicable types.

Here's an example benchmark, reading and writing a named tuple with nesting (run on a Ryzen 5950X).

x = (;
    f1 = 1.25,
    f2 = Int64(123),
    f3 = now(UTC),
    f4 = true,
    f5 = Int32(456),
    f6 = d128"1.2",
    f7 = (; x = uuid4(), y = 2.5)
)

@btime begin
    buf = $(UInt8[])
    empty!(buf)
    writer = BSONWriter(buf)
    writer[] = $x
    close(writer)
end # 78.730 ns (0 allocations: 0 bytes)

@btime begin
    buf = $(BSONWriteBuffer())
    empty!(buf)
    writer = BSONWriter(buf)
    writer[] = $x
    close(writer)
end # 58.303 ns (0 allocations: 0 bytes)

buf = UInt8[]
writer = BSONWriter(buf)
writer[] = x
close(writer)
@btime BSONReader($buf)[$(typeof(x))] # 103.825 ns (0 allocations: 0 bytes)
@btime BSONReader($buf, UncheckedBSONValidator())[$(typeof(x))] # 102.549 ns (0 allocations: 0 bytes)
@btime IndexedBSONReader($(BSONIndex(128)), BSONReader($buf))[$(typeof(x))] # 86.876 ns (0 allocations: 0 bytes)
@btime IndexedBSONReader($(BSONIndex(128)), BSONReader($buf, UncheckedBSONValidator()))[$(typeof(x))] # 84.987 ns (0 allocations: 0 bytes)
@btime IndexedBSONReader($(BSONIndex(128)), BSONReader($buf)) # 47.896 ns (0 allocations: 0 bytes)
@btime $(IndexedBSONReader(BSONIndex(128), BSONReader(buf)))[$(typeof(x))] # 41.434 ns (0 allocations: 0 bytes)

We can observe BSONWriteBuffer makes a material difference to write performance, while indexing, in this case, has a reasonable effect on read performance even for this small document. In the final two lines we can see that indexing and reading the indexed document breaks down roughly half/half. Using the unchecked validator has a smaller impact, and must be used with caution.

ObjectId

BSONObjectId implements the BSON ObjectId type in accordance with the spec. New IDs for the process can be generated by construction, or using bson_object_id_range(n) for more efficient batch generation.

id1 = BSONObjectId()
DateTime(id1) # The UTC timestamp for when id1 was generated, truncated to seconds
id2 = BSONObjectId()
id1 != id2 # true, IDs are unique
collect(bson_object_id_range(5)) # 5 IDs with equal timestamp and different sequence field

Related Packages

lightbson.jl's People

Contributors

ancapdev avatar dantaras avatar green-nsk avatar poncito avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

lightbson.jl's Issues

Support big endian arch

Not hard to implement, but need a system to test on and github hosted runners are only x86

UInt32, UInt64 and Float32

Hi Christian! Hope you're doing well :) I have been playing around with your library and noticed it doesn't support some of the basic primitives. This is somewhat related to #8 and I imagine the reason for this is that they are not strictly speaking part of the bson spec. I also imagine that the issue is that during deserialization to a target type T you'd need to search for all possible source primitive types that map to the encoded type? say you can store a uint32 into a "timestamp" but then when decoding a "timestamp" you'd need to really see if the target field is uint32, uint64 or timestamp? Do you think that adding support for all "standard" primitives is out of the scope of your library?

Remove bson_supersimple

This optimization can be checked and implemented for all qualifying bson_simple(T) types in bson_write_simple().

Support conversions in tuple's

Conversions API currently doesn't look inside tuples.

using LightBSON

struct X
    x::String
end

function Base.setindex!(writer::BSONWriter, value::X, index::Union{String, Symbol})
    GC.@preserve value setindex!(writer, UnsafeBSONString(pointer(value.x), length(value.x)), index)
end

LightBSON.bson_representation_type(::Type{X}) = UnsafeBSONString

buf = UInt8[]; writer = BSONWriter(buf); writer["x"] = X("foo"); close(writer)
reader = BSONReader(buf); reader["x"][X]

However this throws:

buf = UInt8[]; writer = BSONWriter(buf); writer["x"] = (X("foo"),); close(writer)
reader = BSONReader(buf); reader["x"][Tuple{X}]
ERROR: MethodError: Cannot `convert` an object of type String to an object of type X
Closest candidates are:
  convert(::Type{T}, ::T) where T at ~/distr/julia-1.7.3/share/julia/base/essentials.jl:218
  X(::String) at ~/code/scratch.jl:4
  X(::Any) at ~/code/scratch.jl:4
Stacktrace:
 [1] _totuple
   @ ./tuple.jl:330 [inlined]
 [2] Tuple{X}(itr::Vector{Any})
   @ Base ./tuple.jl:317
 [3] bson_representation_convert
   @ ~/.julia/packages/LightBSON/WrSYn/src/representations.jl:14 [inlined]

Defining Base.convert(::Type{X}, ::AbstractString) helps, but but it shouldn't be needed:

Base.convert(::Type{X}, v::AbstractString) = X(v)
reader = BSONReader(buf); reader["x"][Tuple{X}]

Add link to BSONify

HI, just wondering if you could please add a link to my BSON encode/decode package in the "Generic serialization of all Julia types to BSON" line in the README?

Support for more base types?

A bit similar to #7

obj = 1:10
LightBSON.bson_write(path, obj)
LightBSON.bson_read(typeof(obj),path)

throws an error:

KeyError: key :start not found

Should ranges and other common types from Base work out of the box? Or is this out of scope?

Add conversion context

A conversion context passed to readers, writers, and down to representation_convert() and representation_type() will let users extend conversion behaviors without surprises.

Out of the box provide contexts for:

  • No conversions, strictly only BSON types.
  • Default conversions, those currently implemented with representations.
  • Unsigned integer conversions, extending the default context with unsigned integers reinterpreted as signed.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

NamedTuple dispatch

Hi Christian,

I'm not quite sure about this method. I would just remove it in order to be able to benefit from the bson_write_simple optimization. Would you agree?

Then, I'm wondering why you're using this default,

bson_simple(::Type{T}) where T = StructTypes.StructType(T) == StructTypes.NoStructType()

In the case of the namedtuples, they are StructTypes.UnorderedStruct, and hence are not treated as simple bsons by default.
My expectation was that a namedtuple carries no information on its type, and hence I'd be surprise if we would ever consider them as more than just a collection of fields to serialize.

What am I missing?
Thanks,
Romain

Multi dimensional arrays?

First of all, thanks a lot for creating this package. I think it is awesome to have a less magical alternative to BSON.jl.
The following snippet gives an error:

import LightBSON
arr = rand(2,2)
path = "hello.bson"
LightBSON.bson_write(path, arr)
LightBSON.bson_read(typeof(arr), path)
MethodError: no method matching Matrix{Float64}()

I think it would awesome if load+save of Base.Array would conveniently work out of the box.

  • Is that in scope of the package?
  • If so would it be hard to implement?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.