Giter Club home page Giter Club logo

easynic's Introduction

EasyNIC: an easy-to-use host interface for network cards

EasyNIC is the specification of a hacker-friendly interface between a computer and a network interface card (NIC).

Design goals:

  • Make device driver development easy.
  • Provide a minimal "high-speed serial port" operating mode.
  • Support 100G and beyond (use PCIe bandwidth efficiently.)
  • Support optional extensions (in later versions.)

EasyNIC is inspired by the success of RISC-V.

Streaming Transmit and Receive

The EasyNIC "Streaming Transmit and Receive" interface provides bulk packet transfer between the host and the network. The interface is modeled on a serial port where a continuous stream of bytes is transferred. Framing markers to delimit packets are sent as in-band length prefixes.

The host provides one buffer for transmit and one for receive. Packets are prefixed with a 16-bit length header and stored back-to-back. The buffers are rings of bytes i.e. they automatically wrap around once the last byte has been used. The buffers are treated as continuous and no space is used for alignment.

The host writes its updated cursor positions to the device via registers. The device writes its updated cursor positions to the host using DMA. (The host does not read the device state via registers because that would require a high-latency blocking read over PCIe.)

Note: The Ethernet FCS is automatically added and removed by the device. If an incoming frame has an invalid FCS then it is automatically dropped.

Registers:

TX_START [write]
RX_START [write]

  The address of the first byte of the buffer.

TX_SIZE [write]
RX_SIZE [write]

  The total size of the buffer in bytes.

TX_CURSOR [write]
RX_CURSOR [write]

  The current buffer position from the host perspective i.e. the
  offset from the start of the buffer at which the next packet
  will be written (transmit) or read (receive).

TXRX_STATUS [write]

  The address where the device will store up-to-date hardware
  transmit and receive cursor positions in host memory.

  The device writes a 16-byte record to this address comprised of
  the 64-bit transmit cursor followed by the 64-bit receive
  cursor. The cursors specify the offsets from the start of the
  buffer where the next packet will be fetched (transmit) and
  stored (receive).

  The device is expected to update this record as rapidly as
  possible without compromising overall throughput.

TXRX_RESET [write]

  Write the value 1 to initiate an asynchronous reset of the
  device transmit/receive state. This potentially discards the
  contents of enqueued blocks.

  Completion of the reset can be detected via TXRX_READY. Upon
  completion the value of TX_BLOCK_AVAIL will have been reset to
  its initial value.

TXRX_READY [read]

  Read whether the device is ready to transmit and receive data.

  The value is determined strictly as follows:

  - Initialized to 0 on startup.
  - Transitions to 1 when ready.
  - Transitions back to 0 when a reset is requested.

  The host driver should therefore poll for a non-zero value during
  initialization and after each asynchronous reset.

Receive

Diagnostics

easynic's People

Contributors

lukego avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

easynic's Issues

Benchmark setup for validating design

We will need a realistic and accessible benchmarking setup to validate the design. For example we need to be able to experimentally work out what special accommodations the DMA design needs to make to the CPU regarding alignment etc (see #9).

How to do it? Here are a few ideas from hardest / most realistic downwards:

  1. Implement a real EasyNIC on a high-end 100G FPGA.
  2. Implement a fake EasyNIC on an Amazon F1 instance FPGA. This could implement the DMA engine and e.g. automatically loopback TX to RX. This FPGA has 12Gbps of PCIe bandwidth and no network connection (IIUC.)
  3. Just fake it on the CPU e.g. assume that reading from L3 cache has equivalent performance to reading from a NIC and implement registers e.g. via SIGSEGV signal handlers.

The last one seems very convenient. Has any research been done (e.g. to the PMU level) about how well L3 cache (e.g. array too large to fit into L2) works as a proxy for freshly DMA'd data on x86? (Maybe you guys have looked at this @emmericp?)

Avoid individual descriptors

One possible consequence of optimizing PCIe efficiency (#3) is to avoid using individual descriptors for each packet that is transmitted and received.

If consecutive packets are streamed to and from large memory buffers then two bytes of per-packet metadata may be sufficient e.g. to indicate the packet length and the Ethernet FCS validity.

This would be considerably more streamlined than the Intel and Mellanox approaches that typically require between 16 and 48 bytes of metadata for each packet. These scatter-gather designs are burdened with transferring the 64-bit address of each packet's individual buffer(s) and often with other non-essential metadata.

Define the Receive interface

Has to be really easy to interface with on a driver, and really easy to implement in silicon, and really efficient with PCIe bandwidth. (Like the Transmit interface #1.)

Consider adding TX_BLOCK_SIZE register

The initial Transmit design (#5) expects the NIC to fetch a variable-length buffer with DMA. This seems likely to be awkward and inefficient on the silicon side since the device will need to speculatively read ahead and scan for the terminating.

Consider adding a register TX_BLOCK_SIZE where the host writes the exact size of the next block before its address is written to TX_BLOCK_SEND. This way the device can always fetch exactly the memory containing the block. (The block would not need the zero-length-terminator anymore, either.)

Eliminate read from transmit path

Suggestion from @blitz via Twitter: Consider avoiding the synchronous read across PCIe from the transmit path (TX_BLOCK_AVAIL) for the sake of efficiency. The read does not have to be made frequently, only perhaps once every thousand packets, but it will be slow.

Could make sense to have transmit/receive/etc state DMA'd onto the host at regular intervals so that it can be pulled from memory rather than fetched synchronously from the device.

Hardware for 10G EasyNIC

Suppose we wanted to build a 10G EasyNIC. Is there suitable hardware available?

Surprisingly to me, the answer seems to be yes. Here is a suggestion from a twitter conversation with @daveshah1 and others:

EasyNIC 10G would be:

The FPGA would use programmable logic to implement the PCIe endpoint, the Ethernet MAC, and the EasyNIC driver interface. The I/O interface from FPGA to PCIe would be the 4 x 5Gbps SERDES on the ECP5-5G. The I/O interface from FPGA to 10G PHY would be 32x311Mbps using ordinary I/O pins on the FPGA (which is known to support twice this bitrate for DDR3 memory.)

The cost of the FPGA seems to be ~$50 and the PHY ~$20. There are no license fees for the developer tools or hardware features thanks to full ECP5 support in the Yosys open source hardware toolchain.

The NIC might be able to have multiple 10G ports each using separate silicon and connecting to a "bifurcated" PCIe slot.

This is an exciting possibility. Overall this NIC would seem comparable to the Intel 82599. This would make it a practical NIC for many serious applications.

If we decide to move forward with this approach then we could start by developing a 1G NIC using an off-the-shelf ECP5 Versa development board that costs ~$200 and includes PCIe and 10/100/1000 RJ45 connectivity.

Support multiple read/write cores efficiently

Multiple CPU cores should be able to read and write on the EasyNIC at the same time.

Ideally these cores should all share the same transmit/receive interface and efficiently distribute traffic using an algorithm implemented on the CPU. This will likely require some accommodations to be made in the transmit/receive interface to avoid expensive synchronization between the cores and possibly some special rules for memory alignment/arenas/etc as floated on #5.

(The alternative of having the NIC switch packets between multiple transmit/receive interfaces would be nice but it may be prohibitively complex to implement a sufficiently general dispatching mechanism. The trouble with features like RSS is that they only cover special cases of protocols and hashing rules and so on.)

Conserve PCIe bandwidth

PCIe bandwidth is more scarce than it used to be. The host interface has to be designed to use PCIe bandwidth efficiently. This will probably involve organizing DMA into fewer longer contiguous transfers rather than more smaller scattered ones.

How come PCIe bandwidth is more scarce? It's because Ethernet bandwidth has been increasing in powers of 10 while PCIe bandwidth in powers of 2. The basic numerical relationship has changed with the transition from 10G/40G to 25G/100G:

Ethernet bandwidth PCIe bandwidth PCIe-to-Ethernet ratio
10G 16G (PCIe 2.0 x4) 1.6x
40G 64G (PCIe 3.0 x8) 1.6x
25G 32G (PCIe 3.0 x4) 1.28x
50G 64G (PCIe 3.0 x8) 1.28x
100G 128G (PCIe 3.0 x16) 1.28x
200G 256G (PCIe 4.0 x16) 1.28x

In the good old days of 10G/40G the PCIe links had 60% extra capacity for overhead such as transferring DMA descriptors and for the PCIe protocols themselves. These modern times of 25G/100G are leaner and only 28% is available. This means that we must treat PCIe bandwidth as a scarce resource because any wastage is likely to actually impact operational performance.

Hard to keep up with the concepts!

Hi,

I have been investigating the discussions; they are pretty cool! And facts in the README.MD and discussions look sensible. However, conceptually, I cannot digest why?!

For instance, it is stated that the success of RISC-V inspires this project, How? Is there more info we could follow?

I'm not experienced in device drivers. But I have played with IXGBE driver and netdevice.h in Linux kernel for a project, and I have reviewed the codebase around that area.

I found this project interesting! As data centers Ethernet is moving toward 100G and 400G, it becomes intriguing to how system software and hardware should behave and manage such rates at the data centers. I have read a paper that looks have some synergy with this project(same problem, different perspective and different solution). However, I know that this paper is out of the scope of this repo through.

I believe if there are some illustrative documentations or links or papers that I could follow and study then not only me but more people can get involved in such interesting projects.

Thanks
Alireza

Define Transmit interface

Has to be really easy to interface with on a driver, and really easy to implement in silicon, and really efficient with PCIe bandwidth.

Avoid streaming non-idempotent register writes?

Suggestion from @blitz via Twitter: Avoid non-idempotent streaming writes to the same register, such as the way the host writes multiple addresses back-to-back into TX_BLOCK_SEND and requires the device to process all of them in FIFO order. Awkward to handle this on the hardware side? What is a better approach?

[Mildy interesting] NIC speed vs. driver lines of code

drivers-loc-scatterplot

The plot shows Ethernet drivers in Linux 4.19 and DPDK (forgot the exact version) by the maximum speed they support vs. the lines of code. There is a linear correlation (R^2 = 0.37) between network speed and driver complexity, so it's only going to get worse for now.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.