Giter Club home page Giter Club logo

delugefs's Introduction

This fork is brought to a simmer over medium-low heat, and modified to taste for FreeBSD.

Overview

DelugeFS is a distributed filesystem @keredson implemented as a proof of concept of several ideas he's had over the past few years.

Key features include:

  • a shared-nothing architecture (meaning there is no master node controlling everything - all peers are equal and there is no single point of failure)
  • auto-discovery to suggest local network peers
  • able to utilize highly heterogeneous computational resources (ie: a random bunch of disks stuck in a random number of machines)

Key insights this FS proves:

  • Efficient distribution of large blocks of immutable data without centralized control is a solved problem. libtorrent.
  • Efficient sharing of small quantities of mutable data (without a central server) is a solved problem. Mercurial, or Git, and syncthing.
  • Automatic discovery of peers on a local network is a solved problem. Zeroconf/Bonjour.
  • Emulating a filesystem is a solved problem with FUSE.
  • All of these projects have Python bindings!
  • The key node in a distributed filesystem is the disk, not the machine. Everything above the disk is network topology.

This Fork's Planned Changes

  • try using pyfilesystem

  • split files into chunks similar to GridFS

  • If we use GridFS (Apache License), I think I have to update to GPLv3

  • what size should the file chunks be? investigate how torrent chunks files

  • a multiple of your FS sector/stripe size?

  • make sure not to split into too many chunks or iops will be slow on split/reassemble

  • if we are transferring them, what size is good for a retry operation?

  • track chunk size for variable size chunks, eg. at end of file

  • store user desired filename as array of torrent hashes, save torrents in metadata folder, e.g. ./meta/torrents

  • track chunks we NEW -ly added using a metadata folder/file (so other nodes can find us if MetaSyncLayer and SyncLayer are different)

  • track chunks others HAVE using a metadata folder/file (so we can determine if we need to make another copy in some sort of reed-solomon algorithm)

  • track chunks we WANT using a metadata folder/file (because all nodes can't seed forever, we use this to dynamically add/remove seeding requests for chunks)

  • on read: verify the SHA256 before returning to FS

  • split up the mount logic and the sync logic so that root can mount and talk to a an unprivileged MetaSyncLayer and SyncLayer

  • MetaSyncLayer could use syncthing (with recursive folder scan disabled) for metadata and chunks in separate repos,

  • or MetaSyncLayer syncthing for metadata and SyncLayer torrent for chunks (to grab from multiple sources)

  • can we run multiple instances of syncthing on one computer with different programroot folders? or do we have to do something like OpenZFS vdevs?

  • manually add external network peer (this should be handled by syncthing)

  • Firewall hole punching for git on peers behind NAT/firewalls (handled by syncthing and torrent dht)

  • remote the "keep_pushing" loop and let syncthing handle that

  • remove git (let syncthing handle conflicts)

  • local network peers autodiscovery to suggest in syncthing??

  • cross datacenter (or "domain") and cross node algorithm so we don't end up with 2 copies in 2 disks but only on 1 node in 1 datacenter

  • if I/O across the FUSE boundary is still CPU limited around ~10MB/s, see if tmpfs/ramdisk will help

This Fork's Planned Underlying Tree Layout

/mnt/disk1/                                     # let's say this is mounted to ~/myfiles/synced/
    +- .git.meta/                               # separate-git-dir
    +- meta/                                    # was "gitdb", now MetaSyncLayer will handle syncing this across nodes
    |   +- .gitcluster.ssh/
    |   |   `- ...                              # gtfc specific cluster information
    |   |
    |   +- new/                                 # on a FS write, each node writes what chunk it has added here
    |   |   `- SHA256.torrent                   # newly added chunks, if not lazy, auto-add these torrents
    |   |
    |   +- want/                                # copy from chunks we want that seeders...
    |   |   `- SHA256.torrent                   # should auto-add if they have it
    |   |
    |   +- torrents/                            # master library of chunk metadata
    |   |   +- (SHA256 depth+width)             # split like git, to avoid limit of approx 32000 files in a folder
    |   |       `- SHA256.torrent
    |   |
    |   +- catalog/                                 # on 100% dl, append the chunk SHA to your catalog
    |   |   `- domain/                              # for cross datacenter algorithm
    |   |       `- hostname/                        # for cross node algorithm
    |   |           `- programroot-safefoldername   # use programroot folder name in case we want...
    |   |                                           # another copy on a different folder without mounting
    |   +- index/
    |       `- filename                         # traversing the FS, really just traverses this
    |
    +- chunks/                      # was "dat", used by FS and SyncLayer
    |   +- (SHA256 depth+width)     # split like git, to avoid limit of approx 32000 files in a folder
    |       `- SHA256               # the chunks, on read: verify the SHA256 before returning to FS
    |
    +- tmp/                 # stores data that is going to be written
        +- uuid.whole       # this is a whole file that needs to be chunked
        `- uuid.chunk       # this is a chunk of a file that we will SHA256, move to ./chunks/, then create a torrent

/mnt/disk2/         # let's say this one is not mounted
    +- meta/        # as above
    +   ...
    +- chunks/      # as above
        `-  ...

This Fork's Implemented Changes

Tick if manually tested:

  • specify BitTorrent start port number with --port
  • Git instead of Mercurial
  • SSH prep-node script to bootstrap id_ed25519.pub and known_hosts
  • force SSH command in authorized_keys2 after "git push", "git merge tomerge" ... command="git-shell -c $SSH_ORIGINAL_COMMAND"
  • torrent.info.name should be sha256 of file instead of uuid.hex so we can dedup if this creates similar torrent.info.hash

TODO, but no idea yet on how to implement

  • Find a better way to invalidate cache (instead of always invalidating it via sysctl)

Requirements

FreeBSD:

  • kldload fuse
  • pkg install py27-pybonjour mDNSResponder
  • pkg install py27-libtorrent-rasterbar libtorrent-rasterbar
  • pkg install fusefs-libs

And another way to sync the meta data, maybe Syncthing or gtfc:

  • pkg install syncthing

Current Status

HIGHLY EXPERIMENTAL! -- PROOF OF CONCEPT ONLY -- DO NOT USE FOR ANY CRITICAL DATA AT THIS POINT!

In 2013 @keredson was using it as personal media center storage spanning three disks on two machines. It works well so far, but it still very early in development.

Speed:

  • I/O across the FUSE boundary is CPU limited. Max observed is ~10MB/s. @keredson suspects this is a limitation of the Python FUSE bindings.
  • I/O between nodes is limited by the disk read/write speeds. @keredson has observed >70MB/s sustained on his home network.

Known Issues

  • Files over ~4GB are not stored (and their zero-length stubs cannot be deleted). @keredson believes this is due to an int vs. long incompatibility with libtorrent, but hasn't confirmed.
  • hardlink and symlink fails
  • set ctime/mtime on a file fails (may be fixed with syncthing)
  • set owner fails, but doesn't error
  • set permission fails, but doesn't error

Basic Algorithm

To start up:

  1. Filesystem is started given a volume id, a storage location, and a mount point.
  2. Filesystem searches for local peers.
  3. Filesystem either pulls from our clones other peer's repositories.
  4. Filesystem looks for any files it has locally (complete or not), and starts seeding them.

To write a file:

  1. Filesystem client opens a file and attempts to write. Filesystem returns a handle to a local temporary file.
  2. Client writes to file and then closes it.
  3. Filesystem splits file into chunks, store filename as array of chunks and size, chunks are named as secure hashes of its contents, create torrent of each chunk (containing metadata about the file along with secure hashes of its contents) and commits it to a local repository.
  4. Filsystem notifies sync layer there are new chunks.
  5. MetaSyncLayer synchronizes the metadata. SyncLayer seeds chunks.

To read a file:

  1. If filesystem already has a copy of the file requested it returns the data directly.
  2. Filesystem loads the torrent file and starts searching for a peer with the file data via BitTorrent's distributed hash table (DHT) peer discovery mechanism.
  3. Filesystem waits for the blocks covering the read request to become available, and then returns the data to the filesystem client.
  4. Repeat per chunk.

To replicate a file:

  1. All peers participate in the BitTorrent swarms associated with each file they have local copies of.
  2. If a peer notices the catalog falls below a threshold, it will send out clone requests to a subset of its peers until the catalog increases.

Example Usage

The first time the first node is ever brought up:

server1$ delugefs bigstore tank/delugefs --create

All future invocations would omit the "--create".

To bring up an additional node on a different disk on the same machine:

server1$ ./delugefs.py --cluster bigstore \
    --root /mnt/disk2/.bigstoredb

(note the lack of the optional mount point)

To bring up an additional node on a different machine:

server2$ delugefs bigstore tank/delugefs --lazy --btport 6881

The --lazy option means data will only be transfered during a READ, instead of in the background

That's all there is to it!

Why not ... ?

Couchbase + CBFS + FUSE

  • RAM requirements too high and XDCR encryption only in Enterprise.
  • No sync between different platforms.

MongoDB + GridFS + FUSE

  • Do you really want to run a Mongo cluster?
  • Not sure if it can sync between different platforms.

CouchDB + couchdb-fuse

  • The db format for CouchDB is one file /var/db/dbname.db which is not great for large binary data.
  • That db file would require compaction.
  • Replication is kinda of peer-peer, but not chunks per peer.

delugefs's People

Contributors

johnko avatar keredson avatar

Stargazers

vad babushkin avatar  avatar

Watchers

 avatar James Cloos avatar Samuel Loury avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.