Giter Club home page Giter Club logo

interface-ipld-format's Introduction

IPLD

Welcome!

This repository is the entrypoint to the IPLD project: documentation, specifications, the website, and a great deal of the design work all live here.

**The website published from this repo is online at https://ipld.io/ ** -- you'll probably wish to read it there, rather than see the source form here, unless you're aiming to contribute patches.

IPLD stands for "InterPlanetary Linked Data, and is a series of standards and formats for describing data in a content-addressing-emphatic way. The people who work on IPLD do so because we want a world where it's easy to build decentralized, distributed, and inter-operable applications, and we believe robust data formats and a clear story for content-addressing them is a key piece of leverage towards that goal.

Finding Us

  • For chats with the developers and the community: Join us in any of these (bridged) locations:
  • On Github:
    • Check out all our repos in the https://github.com/ipld/ organization.
    • Github issues can be used for discussing designs, documenting user needs, and submitting bug reports.
    • Git patches and Github pull requests are welcome! (Although discussing changes via issues or one of the chat venues above first is highly recommended.)

The IPLD project has a Code of Conduct (which is shared with the IPFS project). Collaborators, contributors, and any participants in community spaces are expected to be able to abide by this code.

Docs Development

With Node.js>=16 installed:

  • Setup: npm install
  • Build: npm run build
  • Serve locally: npm run start
  • Test link integrity: npm test
  • Cleanup: npm run clean
  • Review: open a pull request and tag @ipld/reviewers
  • Publish: Merge to master and Fleek will do the rest

License

SPDX-License-Identifier: Apache-2.0 OR MIT

interface-ipld-format's People

Contributors

achingbrain avatar daviddias avatar mikeal avatar richardlitt avatar vmx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

interface-ipld-format's Issues

How to return a partially resolved IPLD link from a custom resolver?

Hello!
I'm playing with writing my own IPLD format and want to be able to link back to the greater IPLD world after partially resolving a path.

  • What should I pass to the callback in resolver.resolve to do this? A CID instance? A base58 CID string? The API says "IPLD Link" but I'm not sure what that means.
  • If remainderPath is empty, how does IPLD decide if the value is a link (and will return the result of resolve(<link>, "/")) or if the value should be returned directly?

The Next API

After @vmx and I had a talk about what the next iteration of the format API should look like it occurred to me that it’s about time for us to start writing all of this out and discussing it.

Here’s an outline of my ideas for the next API iteration.

serialize(native Object)

Returns a Promise that resolves to an IPLD Block instance.

toBlock(binary Blob)

Returns a Promise that resolve to an IPLD Block instance.

reader(IPLD Block Instance)

Returns an implementation of the Reader interface 👇🏼

Reader.get(path)

Returns a Promise that resolves to an object in the following format.

{ value: ‘some value’ }
// or
{ link: CID Instance,
  remaining: ‘some/path’
}
// or
{ link: CID Instance }

Reader.links()

Returns an async iterator of all the links in the block as instances of CID.

Reader.tree()

Returns an async iterator of all the paths available in the block.

Numeric types

There seems to be a problem with handling of numeric types that I haven't seen mentioned yet:

JavaScript and JSON both default to float64 for number representation. This is a problem when handling CBOR or any other 'non-strict' format which can distinguish ints vs floats. Currently when int encoded within CBOR is encoded to JSON and back, it will get back as a cbor with float.

Possible solutions are:

  • Constrain canonical ipld-cbor to one numeric type
  • Add type metadata to json/js serializations
  • asm.js seems to be trying to do something about this, I haven't read much into it, but it may give some hints: http://asmjs.org/spec/latest/#value-types

utils.cid should uses the serialized dag node

can utils.cid take in the serialized dag node instead of the node it self?

Proposal

util.cid(serializedNode, callback)

get the CID of the serialized dagNode

callback must have the signature function (err, cid), where err is an Error is the function fails and cid is a CID instance of the dagNode.

Rational

Currently resolver.put need to serialize the data twice (here and here). This causes preformance overhead can lead to rather confusing bugs.

Define how `toJSON()` should look like

A toJSON() method is useful to have an easy way to interoperate with other tools. You can easily export data out of IPLD as most tools/languages understand JSON.

Though there should be guidelines on how data not natively support by JSON should be represented. For example binary data, but also CIDs.

For CIDs it would e.g. be possible to really split it into its separate parts. This way everyone consuming the JSON wouldn't need to be able to parse a CID, but could still get insight information about e.g. the coded.

Removing `isLink()` from spec

This spec is about APIs a format needs to implement in order to be used by the resolver.

I think that isLink() is a higher level concept. It doesn't need to be implemented by every format, but can be a function of js-ipld-resolver.

@diasdavid What do you think?

Converge both implementations together

  • s/interface-ipld-format/ipld-format (is this still an issue?)
  • s/go-ipld-node/go-ipld-format
  • create go-ipld-node and go-ipld-resolver underneath go-ipld-format
  • update the description in the README to point to the go interfaces as well
  • start describing that "X is an IPLD Format, as it implements the ipld-node and ipld-resolver interfaces (oop way), or makes them available to be used over itself (functional way)"

Should `tree()` return an array of tuples

In #15 the question whether tree() should return an array of objects or just a single object if the values option is given.

I think the spec should be changed so that it returns:

  • If options.values === false and array of paths
  • If options.values === true a single object where each of the paths is a key

@diasdavid Do you know if there's a reason why it isn't already like that?

Remove `options` argument from `tree()`

With merging #38 there is no longer an options argument for tree(). Hence that should be removed from all format implementations to avoid confusion.

A replacement for the previously existing tree options is passing in the root / into resolve(), which is tracked by #34.

I suggest waiting for #34 to be finished, before working on this issue.

Instead of opening an issue on every repository, just let people know on this issue that you're working on it and then link to the PR.

Proposal: Changes to IPLD interface to support chunked nodes.

Here's the problem I've got: I need nodes that are larger than 2MB.

This is not a "sharding" problem as it has been typically categorized. I still have a max size of 10MB, I just need to support nodes that are larger than the bitswap max block size. A 5MB cbor binary is a relatively small object that can fit into memory in programs.

In order to do this today you basically have to implement sharding semantics at every layer of a graph that ever might have just a few thousand links in it. That's not very reasonable, I have a data set which has millions of keys at the top level and I'm doing sharding at that layer, but closer to the leaf nodes there's typically less than 100 links, but every once in a while there will be a few thousand.

However, once you go down the rabbit hole of "how do I chunk a single dag node into a couple blocks" you reach some limitations in the current IPLD interface.

  • You need a way for the implementation to request additional blocks in order to deserialize.
  • The current interfaces that return a single block for serialization need to be able to return many. Same goes for the cid interface, it needs to be able to return multiple cid's for all the blocks necessary to construct the node.

For efficiency, there's a few other things we should be doing as well.

  • Interfaces that return many things (like tree) should return iterables instead of arrays. Arrays are still fine, as they are iterable, but the spec should allow any iterable. If those items are promises they should be resolved.

Once you're doing all this, you realize that you can also offload more of the path resolution to the implementation, since it now has a way to request additional blocks.

Putting all this together, you end up with something very different than the current spec.

I went ahead and documented all of this and wrote an implementation of dag-cbor that will chunk nodes larger than 1MB but still reject nodes larger than 10MB (although that is configurable). Please don't get caught up in the async/await aspects of the proposal and implementation. I'm not trying to have a big callbacks vs promises debate, this was just the most expedient way to write the reference implementation. However, without async iterables some of the interfaces would get pretty hairy to implement with something like streams :(

I'd like to treat this proposal as a discussion rather than an PR. I'm not strongly tied to any particular names or structure. I wrote a full implementation because my ideas needed to be flushed out a bit and it's much easier to show something working than just describe a bunch of ideas.

Proposal for alternative interface for IPLD.

ipld(getBlock)

Must return IPLD Interface.

getBlock is an async function that accepts a CID and returns a promise for
the binary blob of that CID.

IPLD.serialize(native object)

Takes a native object that can be serialized.

Returns an iterable. All items in iterable much be instances of Block or
promises that resolve instances of Block.

When returning multiple blocks the last block must be the root block.

IPLD.deserialize(buffer)

Takes a binary blob to be deserialized.

Returns a promise to a native object.

IPLD.tree(buffer)

Takes a binary blob of a serialzed node.

Returns an iterable. All item sin iterable must be either strings or promises that resolve to strings.

IPLD.resolve(buffer, path)

Takes a binary blob of a serialized node and a path to child links.

Returns a promise to an object with two properties: value and remaining.

value must be either a deserialized node or a CID instance.

remaining must be a string of the remaining path.

Throws an Error() when path cannot be resolved. Error instance should have a
.code attribute set to 404.

IPLD.cids(buffer)

Takes a binary blob of a serialize node.

Returns an iterator. All items in the iterator must be instances of CIDor promises that resolve to instances of CID.

Returns only the CID's required to deserialize this node. Must not contain CID's of named links.

Remove `values` option from `tree()` implementations

With merging #35 there is no longer an values option for tree(). Hence that should be removed from all format implementations to avoid confusion. A proper replacement for that functionality is passing in the root / into resolve(), which is tracked by #34.

I suggest waiting for #34 to be finished, before working on this issue.

Instead of opening an issue on every repository, just let people know on this issue that you're working on it and then link to the PR.

Resolver implementations shouldn't rely on IPFS

Many of the current IPLD Format implementations expect the encoded binary data to be an IPFS block. Though there's not real need for it, they could just rely on the actual data.

This is a tracking issue to change all current available implementations to follow the spec and rely directly on the binary data.

Implementations that needs that fix are:

js-ipld-bitcoin and js-ipld-zcash already follow the spec.

Proposal: Move resolver to use CID instances for links

The current API for resolver.resolve(binaryBlob, path, callback) calls for us to return {'/': baseEncodedCID}.

Now that we're moving to returning CID instances in the serializers it would be better to return instances here as well since the majority of the time, possibly all the time, we end up just parsing this baseEncodedCID back into a CID instance.

Proposal: Remove unnecessary async

The serialize/deserialize/encode/decode methods are async (callback or promise depending on API version) but the APIs that do the actual work as exposed by (for example) protons, pbf and borc are synchronous.

The work @mcollina and Nearform did exposed artificially introduced asynchronous behaviour as a performance bottleneck of the IPFS/IPLD/libp2p codebases and something that should be removed, yet in the API rewrite these methods are still asynchronous.

Are there any IPLD formats that actually require asynchronicity to serialize/deserialize?

util.cid options

Allow the version and hashAlg to be specified when creating a cid.

util.cid(dagNode, [options,] callback)

options.version, defaults to 1
options.hashAlg, defaults to hashAlg for the resolver

See the discussion at ipld/js-ipld#82

Instead of opening an issue on every repository, just let people know on this issue that you're working on it and then link to the PR.

Add `defaultHashAlg` property

If data is inserted into IPLD you need to specify the format as well as the hash that is used. For many formats there's a default hash (often there's even only a single hash type possible). If a format has such a default has, a property on the resolver called defaultHashAlg would be defined.

This way you often won't need to specify the hash type (you might not even know it), but only the format. Examples are:

  • Bitcoin => dbl-sha2-256
  • Git => sha1
  • Zcash => dbl-sha2-256

What do others think of that idea?

API review

@mikeal and I want to review the current API and see how they can be made more ergonomic.

The goal is to have a nice to use and easy to implement API which aligns well with the current vision of IPLD. Breaking the API is possible and even very likely.

Currently also the spec and the format implementations are not in sync at the moment (see the open issues for more information).

Parts of it is that it will use async/await, hence aligns well with the Awesome Endeavour: Async Iterators.

We can use this issue for further discussion/referencing things.

/cc @hacdias

Remove `isLink()` method from formats

isLink() is no longer part of the IPLD Format spec (#23), Hence remove isLink() from the formats that implement it. Less code, less bugs.

In case isLink() is needed in some code, use the one from js-ipld instead once implemented )ipld/js-ipld#126).

Instead of opening an issue on every repository, just let people know on this issue that you're working on it and then link to the PR.

Implement `defaultHashAlg` property

After merging #27 all formats need to implement a defaultHashAlg property (see the spec for more information).

See the discussion at ipld/js-ipld#82

Instead of opening an issue on every repository, just let people know on this issue that you're working on it and then link to the PR.

Support a .toJSON method as part of the spec

@wanderer and @kumavis have shown a lot of interested to have a conversation to JSON as part of the standardised interface.

My only concern is that although we can have a .toJSON, we really can't expect to have a fromJSON, otherwise we would be asking every single format implementor to spend a lifetime figuring out all the quirks of JSON, specially because it does mangle binary and numeric types.

Are we ok with having a .toJSON without a .fromJSON? Is that what people want?

Implementation of nested objects

With #33 it is now specified that if a format implementation receives the root only /, then it should return a nested object based on the paths tree() returns.

Instead of opening an issue on every repository, just let people know on this issue that you're working on it and then link to the PR.

Trying to figure out actual interface described in this repo

So here is my interpretation of the interface described in this repo encoded in flow types

// https://github.com/ipld/interface-ipld-format#ipld-format-utils
interface Format <node> {
  serialize(node, Callback<Error, ArrayBuffer>):void;
  deserialize(ArrayBuffer, Callback<Error, node>):void;
  cid(node, Callback<Error, CID>):void;
}

// https://github.com/ipld/interface-ipld-format#local-resolver-methods
interface Resolver<value> {
  resolve(ArrayBuffer, Path, Callback<Error, Result<value>>):void;
  tree(ArrayBuffer, Callback<Error, Entry<value>[]>):void;
  tree(ArrayBuffer, {level:number, values?:boolean}, Callback<Error, Entry<value>[]>):void;
  isLink(ArrayBuffer, Path, Callback<Error, Link>):void;
}

type Path = string
type Link = { [Path]:CID }

interface Result<value> {
  remainderPath:Path;
  value:Link|value;
}

type Entry <value> = {[Path]:value}

// Details of CID aren't actually important in this context
interface CID {
}

// Callbacks is a function that is either passed an error as first argument,
// or passed just a second `value` argument if successful.  
interface Callback <error, value> {
  (error):void;
  (null|void, value):void;
}

Some notes on my interpretation:

  • Format utils API spec refers to dagNode but it is unclear what is the type or a shape, or whether it has specific one. My understanding is that dagNode representation is specific to the resolver implementation which is why Format<node> has type parameter node to representing dagNode specific implementation would probably do something like

    class GitFormat implements Format<GitNode> {
      // ...
    }

    Where GitNode will have a specefic type / shape.

  • Resolver API refers to resolved value but to me it's unclear what the type / shape of the value is. I assumed here as well that it is specefic to the resolver implementation which is why Resolver<value> uses generic type parameter value here I also assume that specific implementation would define the type / shape of the value.

Unknowns

  • resolver.resolve(binaryBlob, path, callback) mentions that result passed to callback has value that is "The value resolved or an IPLD link". How do I figure which on is it ? Presumably I know what the value type is and there for can figure out if it's that or a link, but then again I'm worried about the ambiguity here. I would much rather have result with type { value:value, remainderPath:Path } | { link:Link, remainderPath:Path } so it's clear which one I get back. Also is a link in this context a tuple with path and CID as described in isLink {'/': <CID> }
  • Can link contain multiple path to CID mapping or is it always one ? If it's always one why not use a data structure that would not require iteration ?
  • resolver.tree passes back array of tuples like [ { '/foo': 'bar' } ...] does that mean each item in that array contains object with single property where it's name represents a path ? If so why not just use a data structure that would not require iteration to access data ? Also my guess was that value in such a tuple is of the same type as one returned in resolve isn't it ?
  • I'm almost certain that given Resolver<value> and Format<node> there is some correlation between value and node type parameters but I can't figure it out. Maybe that relation is implementation detail of the actual resolver ?

New API proposal

I've being asked to provide feedback on API changes / proposals & sadly that usually means I go and challenge peoples efforts. I think it might be better / only fare to propose API myself and let others challenge it instead. So here is what IMO would be a best API:

Using flow syntax as it removes ambiguity

I would also like to use this opportunity to challenge naming conventions as I have following issues with it:

  • serialize is awfully long and easy to misspell, encode is shorter and lot harder to get wrong.
  • deserialize is even longer, decode isn't.
  • format - Is the way in which something is arranged or set out, in other words it corresponds to how and not what. codec on the other hands is what that knows how to encode / decode to a specific format. There for I would argue that codec is much better term to use as it represents implementation that can encode data into binary of specific format and decode binary in that format back to the data. It also make more sense for codec to provide operations named encode / decode than any other term.

We'll use opaque type alias BinaryEncoded<a> to refer to binary encoded data of type a.

export opaque type BinaryEncoded<data>:Uint8Array = Uint8Array

We'll also use CID<a> interface to refer to CIDs for data of type a.

export interface CID<a> {
  toString(): string;
  toBaseEncodedString(base?: string): string;
  buffer: BinaryEncoded<a>;
  equals(mixed): boolean;
}

With that out of the way I would propose that valid IPLD Codec be an implementation that complies with a IPLDCodec<a> interface as defined below:

interface IPLDCodec<a> {
  // https://github.com/multiformats/multicodec/blob/master/table.csv
  format:Multihash;
  defaultHashAlgorithm:HashAlgorithm;

  encode(a):BinaryEncoded<a>|Promise<BinaryEncoded<a>>;
  decode(BinaryEncoded<a>):a|Promise<a>;
  
   links <b>(BinaryEncoded<a>):AsyncIterable<CID<b>>;
   paths <b>(BinaryEncoded<a>):AsyncIterable<string>;
   get <b>(BinaryEncoded<a>, path):IPLDEntry<b>|Promise<IPLDEntry<b>>;
}

type IPLDEntry<b> =
  | IPLDInlineEntry<b>
  | IPLDLinkEntry<b>

type IPLDLinkEntry<b> = { type:"link", target:CID<b>, path:string }
type IPLDInlineEntry<b> = { type:"inline", data:b }

Additional notes:

  1. I have removed cid function, I am convinced that it's a historical artifact. All the implementations I've seen use provided hashAlgoritm (or default if not provided) to create multihash of the encoded data, and then use passed version & multicodec format to instantiate CID.

    In other words I do not see how codec could do anything else, maybe use different serialization method ? Seems unlikely, but if so maybe instead implementation should be required to provide that instead. Another thing I can imagine is a different way to hash ? If so, again should probably codec should just be required to provide optional method for that otherwise cid is method requirement is nothing but a way to introduce bugs.

  2. Anything returning Promise could also return value that it would resolve to. Consumer still can await without any impact & is also free to avoid await when relevant. From implementer perspective there is no need to wrap anything in promise.

  3. This incorporates links API proposed #52

  4. tree is replaced by paths and is scoped to immediate paths, meaning it's not codec implement's job to follow links etc... Generalized tree traversal can be implemented outside and isn't codec implementer concern.

  5. get(blob, path) returns (eventual) "entry" that is either some inline data that was encoded into the blob, or a link with-in the other IPLD node. Note that link has path that corresponds to path to the data with-in that node link is targeting.

  6. Different value for entry type is used ("link", "inline") that enables type checkers (flow, typescript) to refine type with in the union and in general makes it obvious how one figures what kind of entry is back (what if data is null ? what if for some reason returned entry also has being passed a target feild ?)

  7. @mikeal was suggesting in #52 to embrace IPLD Block API which is I'm all up for, however I don't think it should be made codec implementer concern. API proposed in #52 can trivially be layered on top of proposed API and be agnostic of codec implementation.

Concerns

I have brought up in the other thread ipld/ipld#64 (comment) the fact that support for passing parameters should be a fundamental design concern to ensure that things like IPLD Explorer can be just as useful with parametric IPLD Nodes. That however (at least as proposed) conflicts with links(a) which assumes links can be revealed without any parameters. For that reason I would prefer to omit links and replace paths with something like list (Edit: After thinking bit more I don't think list as proposed below is even going help much, I'll get back on this once I'll have more time to think through):

--- a/Users/gozala/Desktop/before.js
+++ b/Users/gozala/Desktop/after.js
@@ -6,8 +6,7 @@ interface IPLDCodec<a> {
   encode(a): BinaryEncoded<a> | Promise<BinaryEncoded<a>>;
   decode(BinaryEncoded<a>): a | Promise<a>;
 
-  links<b>(BinaryEncoded<a>): AsyncIterable<CID<b>>;
-  paths<b>(BinaryEncoded<a>): AsyncIterable<string>;
+  list<b>(BinaryEncoded<a>): AsyncIterable<{ isLink: boolean, path: string }>;
   get<b>(BinaryEncoded<a>, path): IPLDEntry<b> | Promise<IPLDEntry<b>>;
 }

Implement new `cid(blob, ...)`

The IPLD Format spec was changed so that util.cid() now takes the binary blob and not the deserialized DAG Node as argument (#24).

This is a breaking change and needs to be done on all existing formats. So please don't merge before PRs from all formats are approved. This way things can be merged and released in one go.

Instead of opening an issue on every repository, just let people know on this issue that you're working on it and then link to the PR.

properties: align spec with implementations

The spec says that multicodec and defaultHashAlg are top-level properties, but all implementations implement them as properties of the resolver.

I'd argue that the implementations are wrong. The properties should really be top-level.

This issue should be tackled during the IPLD API review.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.