ipfs / go-unixfs Goto Github PK

View Code? Open in Web Editor NEW

107.0 25.0 52.0 26.25 MB

Implementation of a unix-like filesystem on top of an ipld merkledag

License: MIT License

Go 99.93% Makefile 0.07%

go-unixfs's Introduction

go-unixfs

go-unixfs implements unix-like filesystem utilities on top of an ipld merkledag

❗ This repo is no longer maintained.

👉 We highly recommend switching to the maintained version at https://github.com/ipfs/boxo/tree/main/ipld/unixfs.

🏎️ Good news! There is tooling and documentation to expedite a switch in your repo.

⚠️ If you continue using this repo, please note that security fixes will not be provided (unless someone steps in to maintain it).

📚 Learn more, including how to take the maintainership mantle or ask questions, here.

Directory
Install
License

Package Directory

This package contains many subpackages, each of which can be very large on its own.

Top Level

The top level unixfs package defines the unixfs format datastructures, and some helper methods around it.

importers

The importer subpackage is what you'll use when you want to turn a normal file into a unixfs file.

io

The io subpackage provides helpers for reading files and manipulating directories. The DagReader takes a reference to a unixfs file and returns a file handle that can be read from and seeked through. The Directory interface allows you to easily read items in a directory, add items to a directory, and do lookups.

mod

The mod subpackage implements a DagModifier type that can be used to write to an existing unixfs file, or create a new one. The logic for this is significantly more complicated than for the dagreader, so its a separate type. (TODO: maybe it still belongs in the io subpackage though?)

hamt

The hamt subpackage implements a CHAMP hamt that is used in unixfs directory sharding.

test

The test subpackage provides several utilities to make testing unixfs related things easier.

Install

go get github.com/ipfs/go-unixfs

License

MIT © Juan Batiz-Benet

go-unixfs's People

Contributors

Stargazers

Watchers

go-unixfs's Issues

Add created/modified fields to unixfs files

Since created and modified date fields are file system level attributes, was hoping they could get added to the UnixFS file metadata fields.

We see that a date time field has been added to the JS version of this https://github.com/ipfs/js-ipfs-unixfs/blob/master/packages/ipfs-unixfs/src/unixfs.proto.js#L23 so wanted to open the discussion on what you thought about both created and modified date fields and adding it to the UnixFS specifications.

`go-merkledag` dependency version discrepancy with `go-ipfs`

The go-merkledag gx-version used here,

go-unixfs/package.json

Lines 34 to 37 in f425628

 "author": "why", 

 "hash": "QmfGzdovkTAhGni3Wfg2fTEtNxhpwWSyAJWW2cC1pWM9TS", 

 "name": "go-merkledag", 

 "version": "1.0.1"

differs from the one in go-ipfs, which is causing build problems when using gx-go link to work on this repo through the go-ipfs one.

This also raises the more general question: every time a package dependency shared between the two repos is updated, should it be ~~immediately~~ updated in the other one to avoid breaking the build?

(edit: well, not immediately, but I mean, is there an alternative to doing a local gx update just so gx link can work? given that this extracted repo is meant to be used primarily with go-ipfs, the gx link scenario seems pretty common.)

Remove the `child` interface from the `hamt` package

Background: ipfs/boxo#387.

(This is not completely thought out but a broad description of what we need to do. Further discussion and testing will probably be needed.)

The idea is to eliminate the interface and unify Shard and shardValue. Make it evident that values (links to entries in the directory) are stored at the leaf nodes with only one value per node. The Shard at the moment contains many values but that is just because it encodes the child nodes itself as a performance improvement to avoid requesting more nodes at the DAG layer. The performance should be maintained but it needs to be clarified, the shard/shard-value logic should be encapsulated in the loadChild (and related functions).

Concretely:

Remove the child interface and the shardValue.
Make the children node a slice of Shard.
Extend Shard to save the value it was once stored in the shardValue, that is, the key/value pair (that may need to be renamed).
Unify every switch logic that handled the Shard and shardValue separately.
Modify the loadChild to also create Shards instead of shardValues.

Remove `HAMTShard` UnixFS type

go-unixfs/pb/unixfs.proto

Lines 5 to 13 in a20f6cb

 message Data { 

 enum DataType { 

 Raw = 0; 

 Directory = 1; 

 File = 2; 

 Metadata = 3; 

 Symlink = 4; 

 HAMTShard = 5; 

 }

This is a breaking change that goes in line with ipfs/go-mfs#86 to hide implementation details as much as possible: instead of having a Directory (representing the BasicDirectory implementation) and HAMTShard UnixFS types we would just expose the Directory type representing the Directory interface without qualifying which Go type is implementing it (as this is something that can change on the fly and the consumer should not depend on it).

The code change is trivial, assigning @aschmahmann basically to analyze feasibility (now or in future releases).

Can we remove the `expandSparse` call in `WriteAt`?

Now that we are including it in Sync (#33).

It would also be helpful to review the entire function itself.

Do not add an internal timeout for "extra" fetches used to compute directory size in the unsharding algorithm

Continuation and expansion of #94 (comment). Discussing in its own issue as I think this might be too long for a review thread.

The final advice there was to remove the timeout and block on the fetch. I'm not challenging the given advice but giving the full context to make sure we're aligned here since this can have a potentially negative UX effect.

Brief background on the unsharding algorithm:

The user modifies the HAMT directory (adds/removes an entry).
The size has potentially changed. We need to evaluate if it has fallen below the threshold to switch to a basic directory. We do this in the same add/remove operation; we do not defer it to a later stage like the extraction of the CID in GetNode.
We don't want to fetch all the entries so we don't compute the total/exact size and instead just try to fetch as little as possible to prove we are still above the threshold.
If some shards are unavailable in the network (high delay, missing/removed, etc.) we might not be able to retrieve enough shards to make sure we are above the threshold but we can't also claim we are below it.
Since the priority is to have a deterministic relation between directory size and CID in case of doubt we fail the operation, even if we were able to perform the original add/remove.
We can never really be sure when a shard is missing or actually has a long delay so we use a timeout to decide when to fail (for the reason explained above).

Here I'd say: the user has asked us to do an operation, we should keep trying until they cancel.

The user asked to do an add/remove operation but it may not be aware that it entails a potentially big fetch of shards well beyond the ones that were needed for the original operation.

Following this advice we would then remove the timeout and depend entirely on the client when to stop the operation the same way it had to, even before the sharding feature, as an add/remove operation can always be blocking if we don't find the necessary shards to operate on.
The advantage is the reduction of complexity as explained by Steven in the original comment.
The potential disadvantage is depending on a user-defined timeout that was designed for specific add/remove operations and not for ~~massive~~ these "extra" fetches that might exceed the ones needed originally by a noticeable degree.

If we're comfortable going this way a simple 'go for it' is enough; no need to keep expanding on the motivation to remove the timeout.

Maybe we could use `ResolveLink`

go-unixfs/io/resolve.go

Lines 15 to 28 in e3cca8a

 func ResolveUnixfsOnce(ctx context.Context, ds ipld.NodeGetter, nd ipld.Node, names []string) (*ipld.Link, []string, error) { 

 switch nd := nd.(type) { 

 case *dag.ProtoNode: 

 fsn, err := ft.FSNodeFromBytes(nd.Data()) 

 if err != nil { 

 // Not a unixfs node, use standard object traversal code 

 lnk, err := nd.GetNodeLink(names[0]) 

 if err != nil { 

 return nil, nil, err 

 } 

 return lnk, names[1:], nil 

 }

Maybe we could use ResolveLink in node.go(https://github.com/ipfs/go-merkledag/blob/master/node.go#L359)

WDYT? @schomatis

Plan a transition of unixfs to go-ipld-prime

unixfs constitutes a major interpretation of IPFS data into concrete interpretation.
It currently makes use of go-merkeldag, which exposes a primary dagservice interface.
In addition to this interface, elements of unixfs have made extensive use of the concrete protonode implementation used by merkeldag (example)

It would be valuable to have this code transitioned to make use of a nascent DagStore providing similar functionality on go-ipld-prime. This will help to allow for continued performant and clean use of unixfs within the context of ipld-prime backed data fetching.

should FSNode continue to use the protobuf-generated struct implementation, or should we generate an equivalent ipld-prime node representation of that data?
enumerate the interfaces (e.g. a quick glance shows public methods like NewShard make use of a dag service) that would ideally transition from reliance on ipld-format/go-merkeldag to prime equivalents (maybe you haves started this, @gammazero ?)
Find consumers within the IPFS org that would need to update for this, make sure that's reasonable, and come up with a work plan (ordering of package updates / releases) that is mutually agreeable.

cc @aschmahmann @Stebalien @gammazero

The context in `CtxReadFull` is not propagated to the child nodes being read

The context used during read operations is always passed to the child PB node and "bifurcated" with a context.WithCancel(ctx) call,

go-unixfs/io/pbdagreader.go

Lines 50 to 58 in 76afbe2

 func NewPBFileReader(ctx context.Context, n *mdag.ProtoNode, file *ft.FSNode, serv ipld.NodeGetter) *PBDagReader { 

 fctx, cancel := context.WithCancel(ctx) 

 curLinks := getLinkCids(n) 

 return &PBDagReader{ 

 serv: serv, 

 buf: NewBufDagReader(file.Data()), 

 promises: make([]*ipld.NodePromise, len(curLinks)), 

 links: curLinks, 

 ctx: fctx,

I understand that the rationale behind this is that if I cancel a context at a certain depth in the DAG I'm reading, that is, if I cancel a fetch operation of a node (promise) in the DAG, it will cancel all the operations below that depth that are happening in their children.

That context being "bifurcated" always comes from the internal PBDagReader's context:

go-unixfs/io/pbdagreader.go

Lines 128 to 139 in 76afbe2

 func (dr *PBDagReader) loadBufNode(node ipld.Node) error { 

 switch node := node.(type) { 

 case *mdag.ProtoNode: 

 fsNode, err := ft.FSNodeFromBytes(node.Data()) 

 if err != nil { 

 return fmt.Errorf("incorrectly formatted protobuf: %s", err) 

 } 

 switch fsNode.Type() { 

 case ftpb.Data_File: 

 dr.buf = NewPBFileReader(dr.ctx, node, fsNode, dr.serv) 

 return nil

If we're doing a Read call (that calls CtxReadFull with the internal context) the behavior is consistent with the rationale described above, but if the user manually calls CtxReadFull with a custom context it will be applied only to the fetch operations (precalcNextBuf) of the root node but it won't be used for the child nodes at deeper depths,

go-unixfs/io/pbdagreader.go

Lines 169 to 176 in 76afbe2

 func (dr *PBDagReader) Read(b []byte) (int, error) { 

 return dr.CtxReadFull(dr.ctx, b) 

 } 

 // CtxReadFull reads data from the DAG structured file 

 func (dr *PBDagReader) CtxReadFull(ctx context.Context, b []byte) (int, error) { 

 if dr.buf == nil { 

 if err := dr.precalcNextBuf(ctx); err != nil {

The result is that a cancel of the custom context used in CtxReadFull won't cancel fetch operations (promises) at deeper levels, is that the intended behavior?

Add a function that indicates if the UnixFS node is a directory or not

Inside io/directory.go, that should abstract the TDirectory/THAMTShard check.

Reference: https://github.com/ipfs/go-ipfs/milestone/38.

(I can't reference the milestone from the go-ipfs repo 😢 )

Importers: DAGService should not be batched internally by default

The importers wrap the DAG into an ipld.Batch which is hardcoded to do 128 MaxNodes because that's what works for flatfs.

I can see why in the context of go-ipfs optimizations, but it seems to me that at this point the importers should be agnostic to how the DAG service that they use is configured/optimized. Particularly it should be able to use a non-batching dag service, or one optimized differently.

My first impression is that they should probably use the DAGService as it comes it (without wrapping it) and it would be the user of the importer responsibility to call Commit when the importer has returned, if a Batch dag service was used in the first place.

I've realized this investigating some race conditions on cluster, where we use a custom DAGSerivce. It turns out that it becoming automatically batched to batches of 128 does not make any sense for us.

I would like to go ahead and "fix" this (here and in go-ipfs) if the intuitions above are correct...

Trying to use in project, fails with go-merkledag errors

I included this project as a module, but when I tried to build it I was served with the following errors;

# github.com/ipfs/go-merkledag
../../../../go/pkg/mod/github.com/ipfs/[email protected]/merkledag.go:83:16: format.ErrNotFound (type) is not an expression
../../../../go/pkg/mod/github.com/ipfs/[email protected]/merkledag.go:131:11: format.ErrNotFound (type) is not an expression
../../../../go/pkg/mod/github.com/ipfs/[email protected]/merkledag.go:148:15: format.ErrNotFound (type) is not an expression
../../../../go/pkg/mod/github.com/ipfs/[email protected]/merkledag.go:361:14: format.ErrNotFound (type) is not an expression
../../../../go/pkg/mod/github.com/ipfs/[email protected]/merkledag.go:374:14: format.ErrNotFound (type) is not an expression

[unsharding] Follow up on some low-priority fixes/renames/refactors

This issue tracks small things that need to be done to wrap up the unsharding effort (concentrated in #94) but should be taken care of at the end as they are not very complex nor essential, and addressing them now might disrupt the current stack of PRs in development/review without adding much value.

Working in branch https://github.com/ipfs/go-unixfs/tree/schomatis/directory/unsharding-minor-issues.

Document asynchronous choice (#94 (comment)).
Change UpgradeableDirectory name (#94 (comment)).
Un-export all Basic/HAMT switch functions (#94 (comment)).
Make return values explicit in complex functions (#94 (comment), #94 (comment)).
Rename inner HAMT functions to swap/take accordingly (#94 (comment), #94 (comment), #94 (comment)).
Settle on a default number for HAMTShardingSize (#94 (comment)).
Clean up getDagService() member function (#94 (comment)). (This may already happen as part of #99.)
Reevaluate after reading the PR changing the threshold timeout calculations (#94 (comment)).

"Impedance mismatch" between mode as specified in spec/UNIXFS.md and go's os.FileMode

TLDR proposal:

EITHER: add func (...) UnixPerms() ( os.FileMode ) { ... return os.FileMode( perms | os.ModeIrregular) }
OR: extend the UnixFS mode spec to represent os.FileMode with full fidelity

Long version:

The exclusive under-specification of mode to only file permissions presents a problem when combined with the interface proposed by @Stebalien in ipfs/kubo#6920 (comment)

At present within the go-ipfs codebase the only way to derive the type of a UnixFS entry is to query the containing IPLD-ish object. E.g. https://github.com/ipfs/go-unixfs/blob/v0.2.4/file/unixfile.go#L153-L180, with the returned file.Node being differentiated based on the "derived-interface" ( Directory, File, etc ).

With introduction of an actual os.FileMode entry on such nodes, we are introducing an additional spot needing state synchronized. This is further exacerbated that the default of os.FileMode isn't "unknown" but rather "file".

It would be much clearer and less error-prone to introduce the new function into the interface as UnixPerms() or somesuch, returning the required 12 bit value ( which could be an os.FileMode object with ModeIrregular set ) . In the future if we do end up expanding the on-wire data to contain the actual filemode - we can keep UnixPerms() backwards compatible by making it a proxy over the new now-validated-always-correct FileMode().

Thoughts?

Remove the internal `ProtoNode` from the `Shard` structure

Background: ipfs/boxo#387.

The Shard object (UnixFS layer) stores the proto-node (DAG layer) from where it was created (NewHamtFromDag). It uses it as a kind of cache (mostly for its links I'm guessing) but its state is not clear at all (to the point that the Node method doesn't even update it) so internal functions can easily misuse it.

I'm not sure what are the ramifications of all this so the first step would be to remove it and replace it with a link slice copied from the original node (we were already copying the entire node in NewHamtFromDag so this won't take a performance hit). This slice would just be a lazy-loading mechanism of the Shards but once the Shard is loaded we would no longer have use for the link which could be niled (we need to check this). I really want to avoid the kind of double manipulation of the shard state like in insertChild where we update both children and nd.Links. From now on children would be the single place of truth to query the shard state. After this change we can evaluate actually having a cache in the form of a proto-node which actually would be updated in Node() to avoid re-marshaling everything every time the underlying DAG proto-node is needed (that is, if we modified a single entry in the HAMT directory only update the respective nodes but not the entire trie).

`Sync()`/`modifyDag()`: correctly handle offsets bigger than file size

see comment #5326

go-unixfs/mod/dagmodifier.go

Lines 310 to 334 in 929f9a5

 if cur+bs > offset { 

 child, err := node.Links()[i].GetNode(dm.ctx, dm.dagserv) 

 if err != nil { 

 return cid.Cid{}, err 

 } 

 k, err := dm.modifyDag(child, offset-cur) 

 if err != nil { 

 return cid.Cid{}, err 

 } 

 node.Links()[i].Cid = k 

 // Recache serialized node 

 _, err = node.EncodeProtobuf(true) 

 if err != nil { 

 return cid.Cid{}, err 

 } 

 if dm.wrBuf.Len() == 0 { 

 // No more bytes to write! 

 break 

 } 

 offset = cur + bs 

 }

Conform to licensing policy and link contributing guidelines in README

The license section should make sure to conform to the IPFS licensing policy by linking the actual license file.

We should also make sure to link to the contributing guidelines in the contributing section.

Side notes:

Is it correct that copyright is assigned to @jbenet here and not Protocol Labs, Inc?
We don’t technically follow standard-readme here (e.g. no usage section, which is required). Should that note in the contributing section change?

Use only one kind of variable to refer to data types

refer to ipfs/kubo#5055

Document packages in README

See ipfs/kubo#5316 (comment).

Good code example should stand on its own without relying to much on the rest of the IPFS codebase.

State of UnixFS v2

Hey @Stebalien,

I couldn't find any recent document/discussion about UnixFS v2. I was wondering how the design looks like, as I read some research about another filesystem that was created to have extremely fast access on folder levels, use hashes instead of btrees, and is designed to use very low code complexity.

Not sure if you guys heard if it yet it's SpadFS:

http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/

While the structure on disk doesn't really make sense here, the layout of the directories might be interesting to have a look at.

Here are some graphs which compared it to other filesystems under Linux - especially the traverse a directory and add/remove files from directory performance is quite impressive.

https://www.semanticscholar.org/paper/Design-and-Implementation-of-the-Spad-Filesystem-Patocka/2fc025b8bdd14b7457802795df40fbbfd967a407

	"author": "why",
	"hash": "QmfGzdovkTAhGni3Wfg2fTEtNxhpwWSyAJWW2cC1pWM9TS",
	"name": "go-merkledag",
	"version": "1.0.1"

	message Data {
	enum DataType {
	Raw = 0;
	Directory = 1;
	File = 2;
	Metadata = 3;
	Symlink = 4;
	HAMTShard = 5;
	}

	func ResolveUnixfsOnce(ctx context.Context, ds ipld.NodeGetter, nd ipld.Node, names []string) (*ipld.Link, []string, error) {
	switch nd := nd.(type) {
	case *dag.ProtoNode:
	fsn, err := ft.FSNodeFromBytes(nd.Data())
	if err != nil {
	// Not a unixfs node, use standard object traversal code
	lnk, err := nd.GetNodeLink(names[0])
	if err != nil {
	return nil, nil, err
	}

	return lnk, names[1:], nil
	}

	func NewPBFileReader(ctx context.Context, n mdag.ProtoNode, file ft.FSNode, serv ipld.NodeGetter) *PBDagReader {
	fctx, cancel := context.WithCancel(ctx)
	curLinks := getLinkCids(n)
	return &PBDagReader{
	serv: serv,
	buf: NewBufDagReader(file.Data()),
	promises: make([]*ipld.NodePromise, len(curLinks)),
	links: curLinks,
	ctx: fctx,

	func (dr *PBDagReader) loadBufNode(node ipld.Node) error {
	switch node := node.(type) {
	case *mdag.ProtoNode:
	fsNode, err := ft.FSNodeFromBytes(node.Data())
	if err != nil {
	return fmt.Errorf("incorrectly formatted protobuf: %s", err)
	}

	switch fsNode.Type() {
	case ftpb.Data_File:
	dr.buf = NewPBFileReader(dr.ctx, node, fsNode, dr.serv)
	return nil

	func (dr *PBDagReader) Read(b []byte) (int, error) {
	return dr.CtxReadFull(dr.ctx, b)
	}

	// CtxReadFull reads data from the DAG structured file
	func (dr *PBDagReader) CtxReadFull(ctx context.Context, b []byte) (int, error) {
	if dr.buf == nil {
	if err := dr.precalcNextBuf(ctx); err != nil {

	if cur+bs > offset {
	child, err := node.Links()[i].GetNode(dm.ctx, dm.dagserv)
	if err != nil {
	return cid.Cid{}, err
	}

	k, err := dm.modifyDag(child, offset-cur)
	if err != nil {
	return cid.Cid{}, err
	}

	node.Links()[i].Cid = k

	// Recache serialized node
	_, err = node.EncodeProtobuf(true)
	if err != nil {
	return cid.Cid{}, err
	}

	if dm.wrBuf.Len() == 0 {
	// No more bytes to write!
	break
	}
	offset = cur + bs
	}

ipfs / go-unixfs Goto Github PK

go-unixfs's Introduction

go-unixfs

❗ This repo is no longer maintained.

Table of Contents

Package Directory

Top Level

importers

io

mod

hamt

archive

test

Install

License

go-unixfs's People

Contributors

Stargazers

Watchers

Forkers

go-unixfs's Issues

Recommend Projects

Recommend Topics

Recommend Org