Giter Club home page Giter Club logo

dpfs's Introduction

DPFS - DPU-Powered File System Virtualization framework

The DPFS framework allows Cloud and datacenter operators to provide virtualized file system services to tenants using DPU-offloading. With DPFS the complete file system implementation runs on the CPU complex of the DPU. Tenants consume the file system through the virtio-fs device that the DPU exposes over PCIe (multi-tenancy via SR-IOV). DPFS provides a hardware abstraction layer, FUSE API implementation and several file system implementations.

Warning: DPFS is currently a research project, its code is therefore not battle-tested, nor very clean. Use at your discretion.

Research Publications

Design and implementation

DPU virtio-fs architecture diagram

Modules

dpfs_hal

Front-end and hardware abstraction layer for the virtio-fs emulation layer of the DPU hardware. Currently only supports the Nvidia BlueField-2, support for other vendors is in the works. We have worked together with other DPU vendors to make sure our framework architecture/API is compatible with future virtio-fs support for other DPUs.

dpfs_fuse

Provides a lowlevel FUSE API (close-ish compatible fork of libfuse/fuse_lowlevel.h) over the raw buffers that DPUlib provides the user, using dpfs_hal. If you are building a DPU file system, use this library.

dpfs_nfs

Reflects a NFS folder with the asynchronous userspace NFS library libnfs by implementing the lowlevel FUSE API in dpfs_hal. The full NFS connect handshake (RPC connect, setting clientid and resolving the filehandle of the export path) is currently implemented asynchronously, so wait for dpfs_fuse to report that the handshake is done before starting a workload!

The NFS server needs to support NFS 4.1 or greater! Since the current release version of libnfs does not fully implement NFS 4.1 yet (+ no polling timeout), this new version of libnfs is needed, which implements the missing functionality we need.

dpfs_kv

Reflects the contents of a RAMCloud cluster as a flat root directory to the host machine. The key is the name of the file in the root directory and the value is the contents (4k max file size) of the file. This backend is optimized for low latency for many small files through RDMA.

dpfs_aio

Reflects the contents of a file system that is mounted locally on the DPU, metadata operations are synchronous and R/W I/O are asynchronously performed using libaio

dpfs_uring

Same as dpfs_aio but the R/W I/O uses io_uring. See the conf_example.toml for extra io_uring options.

list_emulation_managers

Standalone program to find out which RDMA devices have emulation capabilities

Usage on the Nvidia BlueField-2

The Nvidia SNAP library that is needed to run on BlueField-2 (only DPU currently supported) is closed source and does require a patch to enable asynchronous request completion. Using virtio-fs in SNAP is currently only possible with a prototype firmware and some alterations to the SNAP library. You can reach out to us on how to integrate DPFS and SNAP.

The steps to setup the DPU:

  • Install BFOS DOCA 3.9.3 (Ubuntu 20.04), newer versions might very well work. All the following steps are on the DPU.
  • Uncomment the first deb-src from /etc/apt/sources.list and execute sudo apt update; sudo apt build-dep linux.
  • Upgrade all the packages using apt, except for all the Open vSwitch packages via apt hold (currently the latest release is broken on the BF2).
  • Download MLNX OFED v23.04-1.1.3.0 (latest that we tested) on DPU, and add it as an apt repository (see these docs).
  • Only install mlnx-ofed-kernel-only using apt.
  • Clone the Linux source and checkout to v6.2 (the latest stable tag).
  • Copy the /boot/config file of the most recent kernel to linux/.config, and enable all MLX drivers, and other drivers you might want (e.g. Ceph). The config we used for the thesis experiments can be found in linux_patches/.
  • Compile Linux using make bindeb-pkg -j 7 and install the image, headers and libc using dpkg.
  • Reboot the DPU and confirm via ip addr that the ovs bridges are all there.
  • Flash the prototype firmware using mlxburn

Why all this pain to upgrade the Linux kernel? Because by default the BlueField OS runs Linux 5.4 that was released on 24 November 2019.... Newer kernel versions (6.2 in this case) improve networking, io_uring, etc performance. And this allows you to get kernel modules that are not by default in BF OS, such as Ceph. But DPFS should fully work on Linux 5.4 as well

With the above in mind, the steps needed to run DPFS on the BlueField-2:

  • Install the following deps on the DPU: autoconf cmake binutils libtool libck-dev libboost-thread-dev numactl
  • Patch SNAP to add a virtio-fs device type called "virtiofs_emu"
  • Patch SNAP to support asynchronous completion of virtio-fs requests (needs to be concurrency-safe)
  • Integrate DPFS into the build system of SNAP
  • Enable virtio-fs emulation in the DPU firmware with atleast one physical function (PF) for virtio-fs, and reboot the DPU
  • Determine the RDMA device that has virtio-fs emulation capabilities by running list_emulation_managers
  • Use one of the file system implementations by configuring DPFS through the toml configuration file (see conf_example.toml)

Setup steps for the extern dependencies:

  • Inside DPFS git submodule init; git submodule update --init --recursive
  • Inside extern/eRPC-arm cmake . -DPERF=on -DTRANSPORT=infiniband -DROCE=on; make -j
  • Inside extern/libnfs ./bootstrap; CFLAGS=-O3 ./configure --enable-pthread; make -j; sudo make install; sudo rm /etc/ld.so.cache; sudo ldconfig

Project status

See the Github issues and milestones.

FAQ

What is a 'DPU'?

A DPU (Data Processing Unit), for the scope and definition of this project, contains a CPU (running e.g. Linux), NIC and programmable data acceleration engines. It is also commonly referred to as SmartNIC or IPU (Infrastructure Processing Unit).

DPUs are shaping up to be(come) the center of virtualization in the Cloud. By offloading cloud operator services such as storage, networking and cloud orchestration to the DPU, the load on the host CPU is reduced, cloud operators have more control over their services (for upgrades and performance optimizations) and bare metal tenants are easier to support.

What is virtio-fs?

Virtio is an abstraction layer for virtualized environments to expose virtual PCIe hardware to guest VMs. Virtio-fs is one of these virtual PCIe hardware specifications. It employs the FUSE protocol (only the communication protocol!) to provide a filesystem to guest VMs. There are now DPUs comming out on the market that have support for hardware-accelerated virtio-fs emulation. Thereby having a real hardware device implement the virtual filesystem layer of virtio. We are using the Nvidia BlueField-2 which has support for virtio-fs emulation using Nvidia SNAP (Currently only available as a limited technical feature preview).

Contact and Credits

  • ๐Ÿ‡จ๐Ÿ‡ญ Hybrid Cloud / Infrastructure Software group at IBM Research Zurich
  • ๐Ÿ‡ณ๐Ÿ‡ฑ StoNet-research at VU Amsterdam

For contact about DPFS and the research we are conducting please reach out to: Peter.Jan.Gootzen at ibm d0t c0m. If you are a DPU-vendor looking into support file system offloading on your DPU, we would be happy to help with porting DPFS to your hardware.

dpfs's People

Contributors

imgbotapp avatar pepperjo avatar peter-jangootzen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

dpfs's Issues

Multiple NFS connections

Currently the virtionfs implementation uses as single Virtio queue polling thread and a single NFS socket polling thread (called service thread in libnfs).
The goal is to move to 8 Virtio queue polling threads and 8 NFS socket polling threads (as our DPU has 8 cores). The optimal number of threads is probably eight because the DPU has eight cores and of these 16 threads only 8 would need to run at a time (either sending or receiving).
To do this multiple NFS connections are needed, which is supported in NFS 4.1. This feature is called 'session trunking'.

NFS random verifier and clientid

Currently virtionfs uses a non-random verifier and non-random clientid. This is unreliable and also bad when multiple clients connect to the same server.

Rewrite fuser to use io_uring

Currently fuser is incredibly slow in metadata operations as they all occur synchronously thus blocking the Virtio queue (and the currently polling thread) while the remote operation outstanding, thus incurring huge latencies.
Apparently io_uring supports metadata operations, so this would be a good fit.

Comprehensive experimentation suite

Host NFS vs DPU NFS

  • IOPS
  • Latency statfs and read
  • Bandwidth
  • Metadata workload
  • CPU utlization per IOP

Key-Value implementation

  • Metadata workload
  • IOPS
  • Bandwidth

Metadata workload blocked by

Some metadata FUSE functions are not compatible with async

These functions will create a struct on the stack that the FUSE implementation has to fill in and then call a special reply function such as fuse_ll_reply_attr. However with async these stack-residing structs will get destroyed.
Broken functions to fix:

  • setattr
  • readdir and readdirplus
  • create
  • flush (or just don't use the struct fuse_file_info *, then its fine)
  • flock

`io_uring` feature tracker

  • R/W I/O
  • Completion queue polling (i.e. userspace-side polling)
  • getattr
  • fsync
  • Fixed buffers
  • Fixed files
  • Submission queue polling (i.e. kernel-side polling)

Rust bindings

If this project is to be used for more long-term projects, it should move away from C and something more safe but low-level like Rust. The frequency of memory related bugs and undefined behavior is very high. The DPU library is in C, so two approaches are possible.

  • Create Rust bindings for the DPU library
  • Create Rust bindings for the lowlevel dpu-virtio-fs library and then reimplement the FUSE facilities in Rust

TODO

  • Figure out how to create rustbindings for an automake project

Performance of virtionfs

Currently with XLIO sequential write performance with bs=4k, iodepth=16, numjobs=1 is ~244MB/s, while sequential read only gives ~20MB/s.

  • Investigate why this is happening (there are already some flamegraphs ran on commit 7fc9e18)
  • Implement a fix

Furthermore blocksizes larger than the page size (4k) don't do anything (i.e. no performance increase).

  • Investage why
  • Impl fix

NFS lease period not implemented

In NFS the filehandle returned by OPEN is only valid for the lease period (attribute lease_time which is in seconds). Currently no consideration of this is made, thus after the lease period has ended we start erroring.
https://www.rfc-editor.org/rfc/rfc5661#section-5.8.1.11

It is likely that this is causing the 10025 (NFS4ERR_BAD_STATEID) and 10020 (NFS4ERR_NOFILEHANDLE) NFS errors that occur some time during the workloads.

Virtio-fs Linux driver does not support multi-queue

Problem
https://elixir.bootlin.com/linux/latest/C/ident/virtio_fs_wake_pending_and_unlock

In the virtio_fs_wake_pending_and_unlock function the queue on which the request will be put, is hardcoded to a single virtio-fs request queue.

Execution
Development is happening here: https://github.com/Peter-JanGootzen/linux

TODO:

  • Select queue depending on CPU id
  • Fix needing to use get_cpu (smp_processor_id seems not to be allowed in tasks, but get_cpu disables preemption)
  • Set interrupt affinity
  • Simple cat test (only runs cat on a single core)
  • Big Ubuntu fio test (multi-core, check whether it actually goes into different queues with debugfs)
  • CPU Hotplugging (see virtio-net)
  • Implement round-robin queue scheduling (using atomics) in virtio-fs for comparison
  • Send it in as a formal patch (not a priority, but would be good for the paper)

FUSE:init is not async

This is a big problem virtionfs needs to execute asynchronous handshake requests in the init.
Currently it just sends the INIT FUSE completion before those are done. This can result in race conditions on boot.

An attempt to fix this was made in this commit, however this made random operations randomly break. Seems there is some timeout on FUSE:init or that commit was triggering UB.

Investigate possible uninitialized memory with NFS operations

The NFS operations nfs_argop4 that are created for every NFS request are not zeroed out. There might be possible initialized memory in there. This is a very easy mistake as some of the operations have very complex RPC structures. So these need to be zeroed going forward.

Currently not able to receive NFS callbacks

Because of limitations in libnfs and the absence of a need for NFS callbacks support, virtionfs currently doesn't support anything to do with NFS callbacks. NFS callbacks allow the server to notify or ask the client (of) something, mainly used for cache invalidation and delegations (when a file resides on a client). See the NFS 4, 4.1 and 4.2 RFCs for all the callbacks.

NFS connection timeout

The NFS RPC seems to disconnect after a while of inactivity. An option is to periodically send a NFS:NULL to keep the connection alive. Or use rpc_set_autoreconnect of libnfs maybe (not sure what it does)?
This might be non-trivial because the current architecture has no good place for something periodic like this.

Investigate fuser problems

  • fio rand_iops benchmarks returns -14 error
  • src_ino not used and wrongly?
  • inode not being properly locked (especially during the inode table operations)

Implement missing file system operations in virtionfs

These are all non-essential for experiments:

  • create
    • Send-side programmed but untested, no receive-side
  • setattr
  • flock
  • readdir and readdirplus
  • mkdir
  • rmdir
  • mknod
  • symlink
  • unlink
  • fallocate
  • rename
  • forget and batch_forget

UID and GID not applied on a per-request basis

Currently the UID and GID that are supplied to FUSE:init are used throughout the whole lifetime of the virtio-fs device (fuser and virtionfs).

However each individual FUSE request contains a UID and GID, so each request should be executed under the name of that request's UID and GID.

Help needed: Prototype firmware for DPFS in BF2

I posted my question in discussion couple days ago, I think I should move it here for visibility.

Hello,

Currently we are configuring our own environment to evaluate DPFS by following the instructions in the github (DPFS/README.md at master ยท IBM/DPFS ยท GitHub) . From the document, a prototype firmware is required for the whole system to work properly. Would you please provide more details about this firmware? Is it developed by NVDIA specifically for the DPFS project? How can we obtain the binary?

Thank you.

Use `CLAIM_FH` for NFS:OPEN instead of inefficient `CLAIM_NULL`

Currently when opening a file CLAIM_NULL is used in conjunction with CURRENT_FH=parent->fh and the filename of the file to be opened. This requires tracking the parents of files plus the filenames of files inside of virtionfs.

In NFS 4.1 there is a CLAIM_FH flag that allows you to open the file that the CURRENT_FH points to. Removing this extra bookkeeping overhead from the client.

Blocked by #2 and #15 because NFS 4.1 is required.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.