Giter Club home page Giter Club logo

Comments (11)

Stebalien avatar Stebalien commented on June 3, 2024 2

The discussions usually aren't behind closed doors, they're just all over the place (https://github.com/ipfs/notes/issues, https://github.com/ipld/specs/issues). That's why I'm trying to fill you in on some of the details. Otherwise, we can't really have an informed discussion.

that seems like has already been rejected long ago for subjective / aesthetic reasons as well as possibly what I perceive as unfounded beliefs

I've expressed my doubts about it being viable in UnixFS due to size restrictions but, as you pointed out, we can probably come up with a solution that just works for powers of two chunk sizes. However, we won't want to then restrict users and tell them that they can't use fancy chunking algorithm X.

For example, users may want to chunk videos at key frames. Or users may want to chunk a tar file on file boundaries. Or, worse, they may want to do both! What if I have a tar file with a bunch of videos, images, etc; each chunked using a different algorithm depending on the file type? Describing all that information in a few bytes is a tall order.

Basically:

  1. I think this is an interesting and potentially useful idea.
  2. I do not want to extend the CID spec to turn CIDs into something they're not (they're just pointers). However, we won't need to given inlining.
  3. A cursory examination of the problem tells me we probably won't come up with solution that balances usefulness against complexity/performance/ux. However, I may be wrong.

From experience actually the more I will try to argue for my points

I really don't mean to discourage you, I'm just trying to give you some context so you don't waste time going down too many dead-end paths.

from cid.

Stebalien avatar Stebalien commented on June 3, 2024 1

99.999% of CIDs would be UnixFS anyway

Not even close. This is only appears to be the case now because the IPLD documentation/spec/implementations aren't really up-to-snuff. However, IPLD is a general-purpose merkle-linked data model.

I don't think that adding 5-11 bytes in addition to the existing 36 is that meaningful.

It's not 5-10 bytes. That's the minimum that only supports existing features. We'd likely need to double the size (and I can think of quite a few nice things we can do with that extra space if we decided doubling the size was OK).

I think the main value is data preservation: it would allow to "revive" dead IPFS links even if the original dag document is not available anymore.

By 1.0, we hope that most users will just be able to stick with the default options. In that world, ipfs add should be idempotent (generally).

It also makes sense to have a hash that just works by itself without being dependent on the availability an online resource (seems basic to me), and seems strongly in the IPFS spirit.

So, in the ideal world, users wouldn't have to take data out of IPFS.

I was intending this for a different design (say CIDv2? or who knows CIDv31?) using a flexible binary format (e.g. protocol buffers) that allows for flexible custom metadata, so I definitely did not intend that UnixFS would be "bolted" in the standard somehow!

Got it. From a design standpoint, my primary objection was putting UnixFS stuff in IPLD. Custom (non-unixfs specific) data would make a lot more sense.

However, as I noted, we shouldn't need to extend the CID spec to do this. You can use the identity hash function to compute the CID <cidv0>-<codec>-<id-hash-codec><length><your inline data>. That allows one to embed arbitrary data into a CID. This object can include the file metadata and a CID pointing to the file.

Again, the issue is size versus flexibility. We'd need to come up with a compact, extensible representation for this information (and make sure we don't add too much complexity to UnixFS along the way).

I'm a little bit surprised that CIDv1 was designed be static. And since it hasn't really come into use yet maybe you'll consider skipping it and going with something more dynamic, or it least consider this for the future revision, I guess.

CIDv1 is the V1, that's why it has the version. However, we still have to think very carefully about what we want to put in CIDs and why. Specifically, we have to consider, conceptually, what a CID is and the role it plays in IPLD:

In IPLD, a CID is effectively a pointer. It (a) uniquely identifies some piece of data (the hash) and (b) tells us about the data's memory layout (the codec). Embedding arbitrary metadata into a pointer doesn't really make any sense.

I like the concept a lot, anyway. Being an outsider I guess I can't influence much. I cannot reopen the issue and I'm not sure if it will help much, even if I could.

(hm, I thought an OP could reopen their own issues)

I closed the issue as I believed it was resolved (we have too many open issues). I can of course reopen it if you don't feel that way. However, IMO, this discussion belongs on the unixfs 2.0 repo. If you'd like to discuss arbitrary metadata in CIDs, we should probably do that in a new issue (changing directions in the middle of an issue tends to be confusing).

from cid.

Stebalien avatar Stebalien commented on June 3, 2024

In theory, I agree. However, I'm not sure how we can add all that information to a CID (could be large). It also doesn't really "fit" as CIDs are a part of IPLD (the underlying datastructure system) while files are a part of IPFS/UnixFS. Basically, the layer that understands CIDs doesn't have a concept of "files" and/or the mapping between files and CIDs.

What we can do is record things like the chunking algorithm used inside unixfs files. That would make them reproducible. However, for this to work, ipfs add --force-cid zb2rhe... myimage.jpeg would have to download the root block to be able to pull this information out of the file's metadata.

We could avoid this issue in some cases by having the gateway return this information. That is, when we request a file from a gateway, the gateway can return a header including the file's metadata. We should probably do that anyways. This way, one could write a browser addon that validates files from gateways.

from cid.

Stebalien avatar Stebalien commented on June 3, 2024

I'm going to close this in favor of ipld/legacy-unixfs-v2#15 as I don't really see how we could build this feature into CIDs themselves. However, feel free to reopen if you want to continue the discussion.

from cid.

rotemdan avatar rotemdan commented on June 3, 2024

So for the most part it will need to include the following:

type: unixfs (a couple of bits)
file/directory (a couple of bits)
--trickle (1 bit)
--size-[bytes] or --rabin-[min]-[avg]-[max] (4-10 bytes)

I'm not sure why is that such a large amount of information, when in binary (is there anything I'm missing?)

from cid.

Stebalien avatar Stebalien commented on June 3, 2024
  1. The primary issue is that IPLD != unixfs. That is, IPLD (and CIDs) don't even have a concept of "this data was imported from some file". Doing this would be like taking some HTTP concept and embedding it into TCP.
  2. As for space, we could pack every existing option in there pretty efficiently but that won't give us any flexibility. We will add new options/algorithms over time and need to leave room to upgrade.

Now, we do have a concept of "inlining". That is, if a CID is pointing to a small IPLD node, we can use the identity hash function to encode the IPLD node directly in the CID. This would work around the first issue.

However, I'm still not convinced that's workable due to the second issue. Even the current options will add, as you noted, ~10 bytes (max). CIDs are currently ~36 bytes so that would increase their size by 1/3.

from cid.

rotemdan avatar rotemdan commented on June 3, 2024

I think this is partly subjective and opinion based. I understand the desire for extreme generality but I don't see why not to allow the option for more fine-grained metadata in the link since 99.999% of CIDs would be UnixFS anyway (at least at first). I don't think that adding 5-11 bytes in addition to the existing 36 is that meaningful. I think the main value is data preservation: it would allow to "revive" dead IPFS links even if the original dag document is not available anymore. It also makes sense to have a hash that just works by itself without being dependent on the availability an online resource (seems basic to me), and seems strongly in the IPFS spirit.

I was intending this for a different design (say CIDv2? or who knows CIDv31?) using a flexible binary format (e.g. protocol buffers) that allows for flexible custom metadata, so I definitely did not intend that UnixFS would be "bolted" in the standard somehow!

I'm a little bit surprised that CIDv1 was designed be static. And since it hasn't really come into use yet maybe you'll consider skipping it and going with something more dynamic, or it least consider this for the future revision, I guess.

I like the concept a lot, anyway. Being an outsider I guess I can't influence much. I cannot reopen the issue and I'm not sure if it will help much, even if I could.

from cid.

rotemdan avatar rotemdan commented on June 3, 2024

OK, here's a simple optimization:

If chunk sizes were limited to powers of two, and assuming the minimum was say, 1024 bytes (I believe max is fixed at 256k but I'm not sure?), then it would only take 3 bits to describe the max chunk size, 9 bits for (min, average, max) for Rabin chunking. That's really a tiny amount of space!

Now it might seem like a possibly backward-compatibility issue but actually it isn't really: simply define that for CIDv2 only powers of two are supported as chunk sizes, that's all there is to it, previous versions are not impacted.

A similar approach can be used many other types of metadata.

I mentioned I'm surprised that something like protocol buffers (or a simplified version of it) wasn't chosen as wire format for the CID, because you use it so much in your projects (I'm actually not that deeply familiar with it, but I'm learning).

Psychologically, it may seem like adding single byte field identifier to every piece of information would make the whole thing too cumbersome, complex or confusing or atypical (though the benefit in flexibility is enormous), but remember: nobody cares about this seemingly random string of characters (most normal people don't care about regular URLs).

Here's an amazon URL:

https://www.amazon.com/All-new-Echo-Dot-3rd-Gen/dp/B0792R1RSN/ref=redir_mobile_desktop?_encoding=UTF8&ref_=ods_gw_ha_dt_dc_092318
  • It's 130 character long. Does anyone care? (this is actually a short example)
  • It's variable length
  • It's way less compact and elegant than CIDs
  • It looks like a big mess

I don't think that anyone cares about content ids really, certainly not their aesthetics (aside from hard-core developers maybe). They look like a big pile of garbage anyway. So why not put some actual useful information in them?

If they are supposed to replace URLs/headers why not make them variable length?

BTW, if they were based on something like protocol buffers then the browser could just parse them and show them more elegantly as an object (without necessarily understanding the semantics of all the fields).

I understand you want CIDs to be used in subdomain (base32). It might be a temporary thing until the browser would support CIDs natively (and most likely then hide/simplify them for the users). And I guess it makes sense maybe for IPNS mostly. Then don't simply don't add the optional fields for those CIDs I guess?

from cid.

Stebalien avatar Stebalien commented on June 3, 2024

Now it might seem like a possibly backward-compatibility issue but actually it isn't really: simply define that for CIDv2 only powers of two are supported as chunk sizes, that's all there is to it, previous versions are not impacted.

Again, this is mixing layers. Neither IPLD nor CIDs understand the concept of "chunking" unixfs files.

I mentioned I'm surprised that something like protocol buffers (or a simplified version of it) wasn't chosen as wire format for the CID, because you use it so much in your projects (I'm actually not that deeply familiar with it, but I'm learning).

We chose the most compact representation possible. A CID is a pointer and really, every byte counts (I've worked with projects using IPLD internally that constantly complain about CID size).

Personally, I've railed against the CID size limitation many times. I'd like to have "fat" CIDs that consist of a pointer to some type information and a pointer to the actual data (which would probably solve your use-case as well). However, it turns out hat CIDs really do need to be tiny. Both from a user's perspective (users will sometimes need to manually type them, such is reality) and from a systems perspective (think of a large graph with many small nodes, CIDs can easily dominate the size).

Being an outsider I guess I can't influence much.

(realize I need to respond to this)

This is how I (and everyone else on this project) started. However, those of us who have been working on this for a while tend to have quite a bit of context about why certain decisions have been made. I'm not telling you "no, we're not doing this", I'm pointing out issues in the specific solution you've presented.

from cid.

rotemdan avatar rotemdan commented on June 3, 2024

I don't think I've proposed (or tried to propose) anything concrete (mostly just openly writing what's on my mind as I don't really have the time and energy to reconstruct the reasoning behind possibly many hours of closed meetings of a private company). I've basically pointed towards a general direction (using flexible format) that seems like has already been rejected long ago for subjective / aesthetic reasons as well as possibly what I perceive as unfounded beliefs (e.g. that real users actually attempt to type 50-60 random characters by hand). And that's fine (really!), as that's the nature of design. From experience actually the more I will try to argue for my points (especially as an outsider) the less my message will come across (also, I simply don't have the material to build good arguments). So I might as well just stop here :).

from cid.

rotemdan avatar rotemdan commented on June 3, 2024

I never assumed this suggestion would be applied to a static CID system like CIDv1. I guess I considered it in relation to something very different I had in mind (but didn't explain in detail). In #23 described a flexible content descriptor (not only identifier) system which can accommodate this as purely optional, ignorable metadata. I'll close this issue for now.

from cid.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.