Giter Club home page Giter Club logo

Comments (13)

bbenne10 avatar bbenne10 commented on August 21, 2024 2

An int is probably not enough, but Bitv provides serialisation functions that you can use directly (see here).

Seems I misread max_length as max_size. I'll look into how to get this done in the morning (we just hit 5:10PM EST, so I'm about to head out for the evening). Hopefully, I should have something by about this time tomorrow (provided none of my other fires gets too large to need putting out ;) )

from bloomf.

pascutto avatar pascutto commented on August 21, 2024

Hi! Thanks for your interest.
I'm not really keen to exposing internals of the library to the users, since that would enable them to break some important invariants of the filters.
Do you think writing simple {to,from}_bytes or {to,from}_string would match your use case?

from bloomf.

dinosaure avatar dinosaure commented on August 21, 2024

On other side, the Marshal module is another solution to serialize/deserialize any values.

from bloomf.

bbenne10 avatar bbenne10 commented on August 21, 2024

I'd prefer not to use Marshal, as it breaks between OCaml compiler versions which is a real problem for us.

A {to,from}_bytes or {to,from}_string COULD work.
I noticed after opening this PR that there's an open PR for saving state of the hash functions and the like, which is unimportant for me, but could VERY well be useful. Is there a reason that this hasn't been merged? Or is it just because it needs a rebase?

I completely understand not wanting to expose internals (especially as it breaks type encapsulation on the resultant type), but we'd then need to save a representation of the parametrized type so that the serialize would know how to reconstruct that type in deserialization.

A to/from bytes or string implementation could do that, but would require significant care to ensure that both sides of the operation worked properly. Saving the implementation type works more reliably (technically), but requires care from the developer to know what the type of the original instance was and only compare with that type. I'm fine with the latter, but I'm only comparing one type.

from bloomf.

pascutto avatar pascutto commented on August 21, 2024

I noticed after opening this PR that there's an open PR for saving state of the hash functions and the like, which is unimportant for me, but could VERY well be useful. Is there a reason that this hasn't been merged? Or is it just because it needs a rebase?

Indeed, I actually forgot about #1.
The same caveat applies there, exposing types and functions that break the abstraction and enable the user to change or even know about the internals is not the way to go, as the structure contains subtle invariants that should not be breakable (this is enforced by the interface).
I'm not sure what you're refering to with "the state of the hash functions", but hash functions should behave as pure functions, so there is no state to save. Other parameters would indeed have to be saved though, and that would be the responsibility of the {of,to}_bytes functions.

I completely understand not wanting to expose internals (especially as it breaks type encapsulation on the resultant type), but we'd then need to save a representation of the parametrized type so that the serialize would know how to reconstruct that type in deserialization.

The bloom filters do not contain a representation of the parametrized type. (De)serialization is by essence unsafe, so it's always the responsibility of the user/storage to ensure the data is used correctly when written somewhere else, regardless of the serialization strategy.

from bloomf.

bbenne10 avatar bbenne10 commented on August 21, 2024

I'm not sure what you're refering to with "the state of the hash functions"

Seeds, unless I misread the PR.

The bloom filters do not contain a representation of the parametrized type.

Of course not - that is exactly the problem.
The question is "Given that the storage is trusted and that type definitions of the parametrized type haven't changed, how do we allow the user to serialize both a reference to the parametrized type AND the underlying implementation so that de-serialization is possible?"

Refinements could be made to that question in regards to the parametrized type's definition evolving or something, if you'd like, but this works for me for now.

How do we approach that problem? As mentioned, I'm new to the OCaml world, but quite well versed in the OO world. My first thought is reflection or introspection, but that requires some level of runtime type-checking and type preservation (which - to my understanding - OCaml does away with during compilation). How would you approach this problem? I'm happy to explore and attempt an implementation, but don't want to waste anyone's time with efforts that won't be accepted upstream.

(De)serialization is by essence unsafe, so it's always the responsibility of the user/storage to ensure the data is used correctly when written somewhere else, regardless of the serialization strategy.

Of course. But many of the oft-cited problems with serialization and deserialization fall outside the responsibility (or even the ability) of a library such as this to encapsulate.

Dealing with changed type definitions or mutated hash functions?
Bail if you can (with an exception)
or rely on the developers to know that changes to types/hash functions invalidate serialized data.

Someone else has access to your storage drives and can muck with your data?
Hope you trust them!

There's only so much that a language can do here.

from bloomf.

craigfe avatar craigfe commented on August 21, 2024

How do we approach that problem? As mentioned, I'm new to the OCaml world, but quite well versed in the OO world. My first thought is reflection or introspection, but that requires some level of runtime type-checking and type preservation (which - to my understanding - OCaml does away with during compilation). How would you approach this problem? I'm happy to explore and attempt an implementation, but don't want to waste anyone's time with efforts that won't be accepted upstream.

Indeed: OCaml values are not tagged with their type at runtime; the assumption being that type errors are for the type checker 🙂 As a result, it's not natural in OCaml for serialised values to contain type information either: it's the users responsibility to ensure that type errors do not happen, and neither the OCaml runtime nor the bloomf library is ultimately capable of ensuring "type safety" (in the way that an OCaml programmer thinks of it).

Having said that it's not natural in OCaml, it's still possible. We can imagine an interface where the codecs require the user to pass an explicit "type representation" (like a JVM klass pointer, but well-typed):

val to_bytes : 'a type_repr -> 'a bloomf -> bytes
val of_bytes : 'a type_repr -> bytes -> 'a bloomf

(* e.g. *)
let buffer : int list bloomf = Bloomf.to_bytes Type.(list int) filter

which could then be included in the bytes representation to protect against users against creating a bloom filter at one type, passing it through a pipe, then trying to use it at another. We even have functions exactly like this in Irmin (where it's strictly required, since the codecs are derived from the runtime type representation). However:

  • the user has to build this representation by hand, so the API becomes heavy;

  • Bloomf doesn't do anything with this data other than performing an equality check later (so the user might as well build their own API on top of Bloomf that provides this run-time check);

  • it doesn't have much benefit: only adds a new sort of run-time error that the user might have better expressed at compile time with types (e.g. by tagging bytes with a phantom parameter)

In short: since dynamic typing can be built on top of static typing (but not vice-versa without a performance penalty), we tend to go with the flow of the OCaml language and not have runtime type representations. If this was a language that had already paid the performance penalty of dynamic types, it might be a different situation 🙂

from bloomf.

bbenne10 avatar bbenne10 commented on August 21, 2024

So I fail to see how this leaves us in any different a scenario than simply serializing and deserializing priv?
Please don't read this as disrespect or frustration - I just don't see an option that lets us solves this issue in a way that aligns with @pascutto's concerns.

from bloomf.

pascutto avatar pascutto commented on August 21, 2024

So I fail to see how this leaves us in any different a scenario than simply serializing and deserializing priv?

That would be the way to go, my point was that this should be handled by internal functions (e.g. {to,from}_bytes), and not by exposing the type and doing this in the client code, in order to preserve the abstraction.
Of course, these would be responsible for serializing internal parameters along with the bit array.
What remains for the developer is:

  • Make sure that the serialised/deserialised data is used with the same parametrized type
  • Make sure that the same hash function is used before and after serialisation/deserialisation. In particular, that's where the "state of the hash function" fits, as there is no such thing in bloomf at the moment, you have to provide your seeded hash function to the functor instead.

Please don't read this as disrespect or frustration - I just don't see an option that lets us solves this issue in a way that aligns with @pascutto's concerns.

No problem :)
Maybe we misunderstood what your constraints are. Could you please precise what your use case is, and maybe tell us why {to,from}_bytes functions would not work?

from bloomf.

bbenne10 avatar bbenne10 commented on August 21, 2024

Maybe we misunderstood what your constraints are. Could you please precise what your use case is, and maybe tell us why {to,from}_bytes functions would not work?

I did not intend to imply that they wouldn't. Far from it, in fact, as I believe this would be a good answer! My initial solution is, admittedly, aimed at only my immediate use case (using ppx_deriving_yojson). {to,from}_bytes functions would also be perfectly acceptable and would subsume my need for deriving serializers for Bloomf at all.

I remain unconvinced that I yet understand enough to do this from scratch, but I have attempted to make that clear :P What other state need be serialized? I don't see anything other than m, p_len, k and b that need be written out. The rest of the module seems perfectly well contained and stateless. Am I wrong here? Is there something I'm missing? If not and seeing that all of the fields in question are floats or integers (or serializable to an int, in the case of the actual Bitv), would there be a problem of writing out each variable in question (in some arbitrary fixed order - say alphabetic on variable name) separated by some character (I'd propose ASCII code 31 - Unit Separator)?

from bloomf.

pascutto avatar pascutto commented on August 21, 2024

What other state need be serialized? I don't see anything other than m, p_len, k and b that need be written out. The rest of the module seems perfectly well contained and stateless.

That should indeed be enough.

or serializable to an int, in the case of the actual Bitv

An int is probably not enough, but Bitv provides serialisation functions that you can use directly (see here).

would there be a problem of writing out each variable in question (in some arbitrary fixed order - say alphabetic on variable name) separated by some character (I'd propose ASCII code 31 - Unit Separator)?

That sounds ok to me! Maybe check that the Bitv serialisation does not use that specific character though, or put it at the end.

Let me know if you have any issues implementing this so I can help, or do it if I get the time to :)

from bloomf.

bbenne10 avatar bbenne10 commented on August 21, 2024

I've got what appears to be a working (though maybe not idiomatic) implementation. I'll open a PR for discussion today. I'll have to familiarize myself with alcotest to get some tests working before I feel comfortable with it getting merged though.

from bloomf.

pascutto avatar pascutto commented on August 21, 2024

Added in d581f77, thanks!

from bloomf.

Related Issues (2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.