Here's the problem I've got: I need nodes that are larger than 2MB.
This is not a "sharding" problem as it has been typically categorized. I still have a max size of 10MB, I just need to support nodes that are larger than the bitswap max block size. A 5MB cbor binary is a relatively small object that can fit into memory in programs.
In order to do this today you basically have to implement sharding semantics at every layer of a graph that ever might have just a few thousand links in it. That's not very reasonable, I have a data set which has millions of keys at the top level and I'm doing sharding at that layer, but closer to the leaf nodes there's typically less than 100 links, but every once in a while there will be a few thousand.
However, once you go down the rabbit hole of "how do I chunk a single dag node into a couple blocks" you reach some limitations in the current IPLD interface.
- You need a way for the implementation to request additional blocks in order to deserialize.
- The current interfaces that return a single block for serialization need to be able to return many. Same goes for the
cid
interface, it needs to be able to return multiple cid's for all the blocks necessary to construct the node.
For efficiency, there's a few other things we should be doing as well.
- Interfaces that return many things (like tree) should return iterables instead of arrays. Arrays are still fine, as they are iterable, but the spec should allow any iterable. If those items are promises they should be resolved.
Once you're doing all this, you realize that you can also offload more of the path resolution to the implementation, since it now has a way to request additional blocks.
Putting all this together, you end up with something very different than the current spec.
I went ahead and documented all of this and wrote an implementation of dag-cbor
that will chunk nodes larger than 1MB but still reject nodes larger than 10MB (although that is configurable). Please don't get caught up in the async/await
aspects of the proposal and implementation. I'm not trying to have a big callbacks vs promises debate, this was just the most expedient way to write the reference implementation. However, without async iterables some of the interfaces would get pretty hairy to implement with something like streams :(
I'd like to treat this proposal as a discussion rather than an PR. I'm not strongly tied to any particular names or structure. I wrote a full implementation because my ideas needed to be flushed out a bit and it's much easier to show something working than just describe a bunch of ideas.
Proposal for alternative interface for IPLD.
ipld(getBlock)
Must return IPLD Interface.
getBlock
is an async function that accepts a CID and returns a promise for
the binary blob of that CID.
IPLD.serialize(native object)
Takes a native object that can be serialized.
Returns an iterable. All items in iterable much be instances of Block
or
promises that resolve instances of Block
.
When returning multiple blocks the last block must be the root block.
IPLD.deserialize(buffer)
Takes a binary blob to be deserialized.
Returns a promise to a native object.
IPLD.tree(buffer)
Takes a binary blob of a serialzed node.
Returns an iterable. All item sin iterable must be either strings or promises that resolve to strings.
IPLD.resolve(buffer, path)
Takes a binary blob of a serialized node and a path to child links.
Returns a promise to an object with two properties: value
and remaining
.
value
must be either a deserialized node or a CID instance.
remaining
must be a string of the remaining path.
Throws an Error() when path cannot be resolved. Error instance should have a
.code
attribute set to 404
.
IPLD.cids(buffer)
Takes a binary blob of a serialize node.
Returns an iterator. All items in the iterator must be instances of CIDor promises that resolve to instances of CID.
Returns only the CID's required to deserialize this node. Must not contain CID's of named links.