Giter Club home page Giter Club logo

liburing's Introduction

liburing
--------

This is the io_uring library, liburing. liburing provides helpers to setup and
teardown io_uring instances, and also a simplified interface for
applications that don't need (or want) to deal with the full kernel
side implementation.

For more info on io_uring, please see:

https://kernel.dk/io_uring.pdf

Subscribe to [email protected] for io_uring related discussions
and development for both kernel and userspace. The list is archived here:

https://lore.kernel.org/io-uring/


kernel version dependency
--------------------------

liburing itself is not tied to any specific kernel release, and hence it's
possible to use the newest liburing release even on older kernels (and vice
versa). Newer features may only be available on more recent kernels,
obviously.


ulimit settings
---------------

io_uring accounts memory it needs under the rlimit memlocked option, which
can be quite low on some setups (64K). The default is usually enough for
most use cases, but bigger rings or things like registered buffers deplete
it quickly. root isn't under this restriction, but regular users are. Going
into detail on how to bump the limit on various systems is beyond the scope
of this little blurb, but check /etc/security/limits.conf for user specific
settings, or /etc/systemd/user.conf and /etc/systemd/system.conf for systemd
setups. This affects 5.11 and earlier, new kernels are less dependent
on RLIMIT_MEMLOCK as it is only used for registering buffers.


Regressions tests
-----------------

The bulk of liburing is actually regression/unit tests for both liburing and
the kernel io_uring support. Please note that this suite isn't expected to
pass on older kernels, and may even crash or hang older kernels!


Building liburing
-----------------

    #
    # Prepare build config (optional).
    #
    #  --cc  specifies the C   compiler.
    #  --cxx specifies the C++ compiler.
    #
    ./configure --cc=gcc --cxx=g++;

    #
    # Build liburing.
    #
    make -j$(nproc);

    #
    # Install liburing (headers, shared/static libs, and manpage).
    #
    sudo make install;

See './configure --help' for more information about build config options.


FFI support
-----------

By default, the build results in 4 lib files:

    2 shared libs:

        liburing.so
        liburing-ffi.so

    2 static libs:

        liburing.a
        liburing-ffi.a

Languages and applications that can't use 'static inline' functions in
liburing.h should use the FFI variants.

liburing's main public interface lives in liburing.h as 'static inline'
functions. Users wishing to consume liburing purely as a binary dependency
should link against liburing-ffi. It contains definitions for every 'static
inline' function.


License
-------

All software contained within this repo is dual licensed LGPL and MIT, see
COPYING and LICENSE, except for a header coming from the kernel which is
dual licensed GPL with a Linux-syscall-note exception and MIT, see
COPYING.GPL and <https://spdx.org/licenses/Linux-syscall-note.html>.

Jens Axboe 2022-05-19

liburing's People

Contributors

alviroiskandar avatar ammarfaizi2 avatar axboe avatar bvanassche avatar calebsander avatar carterli avatar dkadashev avatar eli-schwartz avatar goldsteinn avatar guillemj avatar heyrutvik avatar isilence avatar jackieliu1 avatar jeffmoyer avatar jhzeba avatar joshtriplett avatar ldv-alt avatar leitao avatar metze-samba avatar otommod avatar pcewing avatar rouming avatar rouzier avatar shuveb avatar spikeh avatar stefanharh avatar stefano-garzarella avatar tchaikov avatar wlukowicz avatar zhiqiangliu26 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

liburing's Issues

io_uring_cqe.res

Think there is wrong with "io_uring_cqe.res" When i call something like

read-1

cqe = io_uring_cqe()
... io_uring_prep_readv
cqe.res # will output say "5"

read-2

cqe = io_uring_cqe()
... io_uring_prep_readv
cqe.res # will output again the same value as read-1 "5"

while read-2 content length/buffer size is totally different!

not sure whats going on.

Feature request: Support fallocate()

It's not clear if/when fallocate blocks (notable exception for supportive network file systems) but regardless it'd be nice to include in the iouring framework since we can link fsyncs to commit file metadata changes.

Question: Proper usage of io_uring_register_buffers

The man pages seem to indicate that it is fine to register additional/different buffers during the lifetime of a ring. It would help to add to the documentation a statement about when a call to unregister is considered safe.
It comes down to the following question: "Is it safe to unregister and re-register buffers, while an operation like IORING_OP_READ_FIXED is submitted but not yet completed?" Or in other words: "Would one have to wait until all scheduled operations on registered buffers are completed before unregistering?" I assume the latter but it makes it non-trivial to register/unregister buffers at runtime based on the demand of the application. In this case, one would have to drain (IOSQE_IO_DRAIN) the ring before registering additional buffers which will likely cause a hiccup in throughput.

Feature request: Please rethink supporting seeking operations

It's a fact that users don't want to maintain an offset themselves. Seeking operations are still widely used but there's no way to do it using io_uring. (lseek+read+lseek is not atomic)

I think readv/writev without offsets are still reasonable for asio: operations that for different fds can still be run in parallel, as shown in ucontext-cp

libuv has to punt these operations to the threadpool currently. Let's support it natively.

Discussion: write/send number of bytes based on previous read/recv result for IOSQE_IO_LINK

One use case for IOSQE_IO_LINK is that zero copy IO operation, but it's hard to determine how many bytes is correctly read.

For example echo server. It's just ACCEPT -> RECV -> SEND -> CLOSE, but it's hard / not possible to do in zero copy way. The problem is:

  1. The fd used by RECV/SEND/CLOSE is generated by ACCEPT, there's no way to use it in an IOSQE_IO_LINK chain
  2. The number of bytes received from client is known only after RECV completes. You have to wait for RECV's completion to know how many bytes need to be sent.

For 2, man 2 read says that

It is not an error if this number is smaller than the number of bytes requested; this may happen for example because fewer bytes are actually available right now (maybe because we were close to end-of-file, or because we are reading from a pipe, or from a terminal), or because read() was interrupted by a signal.

So that even a simple READ -> WRITE link chain may not always be reliable.

I suggest that add a flag called IOSQE_IO_USE_PREV_RES ( the name is not decided ), which works only with IORING_OP_{WRITE,SEND} must be used together with IOSQE_IO_LINK, indicates that current operation's buffer size is set by previous ret code. If previous ret code <= 0 the operation should generate an error.

What do you think?

Issue: process using io_uring hangs forever for unknown reason

A program written by me became a zombie process for some reason. I didn't fork other process, nor did something special, just normal stuff.

I was testing IOSQE_IO_LINK, FIXED_FILES and FIXED_BUFFERS, if helps.

image

It can't be consistently reproduced, but happened several times. I couldn't kill it. When I was rebooting the system, I got:

image

$ uname -a                                                           23:46:16
Linux carter-virtual-machine 5.5.0-999-generic #202002082109 SMP Sun Feb 9 02:13:41 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

torvalds/linux@d4f309c

Question about LIBURING_UDATA_TIMEOUT filtering

Currently io_uring_peek_cqe filters all cqes with user_data set to LIBURING_UDATA_TIMEOUT, while io_uring_peek_batch_cqe does not. This is desired behavior or not?

Sample code:

#include "liburing.h"
#include <stdio.h>
#include <errno.h>

int main(int argc, char const *argv[])
{
    int ret;
    struct io_uring ring;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;

    ret = io_uring_queue_init(32, &ring, 0);

    if (ret)
    {
        fprintf(stderr, "queue init failed: %d\n", ret);
        return ret;
    }

    sqe = io_uring_get_sqe(&ring);
    if (!sqe)
    {
        fprintf(stderr, "sqe get failed\n");
        return 1;
    }

    // this one gets filtered
    io_uring_prep_nop(sqe);
    io_uring_sqe_set_data(sqe, (void *)LIBURING_UDATA_TIMEOUT);

    ret = io_uring_submit_and_wait(&ring, 1);

    if (ret != 1)
    {
        fprintf(stderr, "submit failed: %d\n", ret);
        return 1;
    }

    ret = io_uring_peek_cqe(&ring, &cqe);

    if (ret != -EAGAIN)
    {
        fprintf(stderr, "peek failed: %d\n", ret);
        return ret;
    }

    sqe = io_uring_get_sqe(&ring);
    if (!sqe)
    {
        fprintf(stderr, "sqe get failed\n");
        return 1;
    }

    // this one is not filtered
    io_uring_prep_nop(sqe);
    io_uring_sqe_set_data(sqe, (void *)LIBURING_UDATA_TIMEOUT);

    ret = io_uring_submit_and_wait(&ring, 1);

    if (ret != 1)
    {
        fprintf(stderr, "submit failed: %d\n", ret);
        return ret;
    }

    ret = io_uring_peek_batch_cqe(&ring, &cqe, 1);
    if (ret != 1)
    {
        fprintf(stderr, "peek batch failed, expected 1, got: %d\n", ret);
        return ret;
    }

    if (cqe->user_data != LIBURING_UDATA_TIMEOUT)
    {
        fprintf(stderr, "LIBURING_UDATA_TIMEOUT expected");
        return 1;
    }
    return 0;
}

iouring randomly hangs forever under high IO load

I am testing following code which spawns 5 threads and drives IOs on each of them. It randomly hangs with and without IORING_SETUP_SQPOLL flag.

#include <errno.h>                                                                                                                                                                                 
#include <fcntl.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <stdbool.h>

#include "liburing.h"

#define BS 4096
#define QD 32
#define MAX_OBJECTS 5

static struct io_uring ring[MAX_OBJECTS];
static int dev_fd;
static int ios;
static bool sqpoll;
static struct iovec iov;

static void *setup_iov_base(size_t size)
{
        void *buf;
        int fd;

        if (posix_memalign(&buf, BS, size) != 0) {
                printf("mem aligned failed\n");
                return NULL;
        }

        fd = open("/dev/urandom", O_RDONLY);
        if (fd < 0) {
                printf("Failed to open urandom. rc=%d\n", fd);
                return NULL;
        }

        read(fd, buf, size);
        close(fd);

        return buf;
}

static int init(void)
{
        struct io_uring_params p = { 0 };
        int i, rc;

        if (sqpoll) {
                p.flags = IORING_SETUP_SQPOLL;
                printf("Initializing liburing with SQPOLL flag\n");
        }

        dev_fd = open("/dev/nvme1n1", O_RDWR | O_DIRECT);
        if (dev_fd < 0) {
                printf("Failed to open nvme device. rc=%d\n", dev_fd);
                return dev_fd;
        }

         for (i = 0; i < MAX_OBJECTS; i++) {
                rc = io_uring_queue_init_params(QD, &ring[i], &p);
                if (rc != 0) {
                        printf("queue_init failed. rc=%d\n", rc);
                        return rc;
                }

                if (sqpoll) {
                        rc = io_uring_register_files(&ring[i], &dev_fd, 1);
                        if (rc < 0) {
                                printf("Failed to register files. rc=%d\n", rc);
                                return rc;
                        }
                }
        }

        iov.iov_base = setup_iov_base(BS);
        iov.iov_len = BS;

        return 0;
}

static inline void submit_to_kernel(char *failure_message, int thread_id)
{
        int rc;

        rc = io_uring_submit(&ring[thread_id]);
        if (rc < 0) {
                printf("%s. rc=%d\n", failure_message, rc);
        }
}

static struct io_uring_sqe *get_sqe(int tid, int *yield)
{
        struct io_uring_sqe *sqe;

        while ((sqe = io_uring_get_sqe(&ring[tid])) == NULL) {
                submit_to_kernel("Failure to wake napping thread", tid);
                *yield = *yield + 1;
                pthread_yield();
        }

        return sqe;
}

static void *submit_io(void *input)
{
        off_t offset = 0;
        int thread_id = *((int *)input);
        int total_ios = ios;
        int yield = 0;
                                                                                                                                                                                                    
        while (total_ios != 0) {
                struct io_uring_sqe *sqe = get_sqe(thread_id, &yield);

                if (sqpoll) {
                        io_uring_prep_writev(sqe, 0, &iov, 1, offset);
                        sqe->flags |= IOSQE_FIXED_FILE;
                } else {
                        io_uring_prep_writev(sqe, dev_fd, &iov, 1, offset);
                }

                sqe->user_data = offset;

                total_ios--;
                if (total_ios % QD == 0) {
                        submit_to_kernel("Failed to submit new IO", thread_id);
                }

                offset += BS;
        }

        printf("[thread_id=%d] Submission complete. yield=%d\n", thread_id, yield);

        return NULL;
}

static void *reap_io_completions(void *input)
{
        int thread_id = *((int *)input);
        int failed_ios = 0, rc;
        int total_ios = ios;

        while (total_ios != 0) {
                struct io_uring_cqe *cqe = NULL;

                rc = io_uring_wait_cqe(&ring[thread_id], &cqe);
                if (rc < 0 || cqe->res != BS) {
                        printf("thread_id=%d rc=%d cqe->res=%d offset=%llu\n", thread_id, rc, cqe->res, cqe->user_data);
                        failed_ios++;
                }

                total_ios--;
                io_uring_cqe_seen(&ring[thread_id], cqe);
        }

        printf("[thread_id=%d] Failed IO count=%d\n", thread_id, failed_ios);

        return NULL;
}

int main(int argc, char *argv[])
{
        pthread_t submit[MAX_OBJECTS], complete[MAX_OBJECTS];
        int t_ids[MAX_OBJECTS];
        int i, rc;

        if (argc != 3) {
                printf("Expected three arguments\n");
                return -EINVAL;
        }
        ios = atoi(argv[1]);
        sqpoll = atoi(argv[2]) == 1;

        rc = init();
        if (rc != 0) {
                return rc;
        }

        for (i = 0; i < MAX_OBJECTS; i++) {
                t_ids[i] = i;

                rc = pthread_create(&submit[i], NULL, submit_io, &t_ids[i]);
                if (rc < 0) {
                        printf("Failed to create submit thread. rc=%d\n", rc);
                        return rc;
                }

                rc = pthread_create(&complete[i], NULL, reap_io_completions, &t_ids[i]);
                if (rc < 0) {
                        printf("Failed to create complete thread. rc=%d\n", rc);
                        return rc;
                }
        }

        for (i = 0; i < MAX_OBJECTS; i++) {
                pthread_join(submit[i], NULL);
                printf("Reaped submit thread_id=%d\n", i);

                pthread_join(complete[i], NULL);
                printf("Reaped complete thread_id=%d\n", i);

                io_uring_queue_exit(&ring[i]);
        }

        close(dev_fd);

        return 0;
}

Following is the output example

Success without SQPOLL
[root@ip-10-0-58-7 liburing]# ./examples/iouring-object 500 0                                            
[thread_id=0] Submission complete. yield=0
Reaped submit thread_id=0
[thread_id=1] Submission complete. yield=0
[thread_id=3] Submission complete. yield=0
[thread_id=2] Submission complete. yield=0
[thread_id=4] Submission complete. yield=0
[thread_id=0] Failed IO count=0
Reaped complete thread_id=0
[thread_id=1] Failed IO count=0
[thread_id=3] Failed IO count=0
[thread_id=2] Failed IO count=0
[thread_id=4] Failed IO count=0
Reaped submit thread_id=1
Reaped complete thread_id=1
Reaped submit thread_id=2
Reaped complete thread_id=2
Reaped submit thread_id=3
Reaped complete thread_id=3
Reaped submit thread_id=4
Reaped complete thread_id=4

Success with SQPOLL
[root@ip-10-0-58-7 liburing]# ./examples/iouring-object 500 1
Initializing liburing with SQPOLL flag
[thread_id=1] Submission complete. yield=883
[thread_id=0] Submission complete. yield=989
Reaped submit thread_id=0
[thread_id=2] Submission complete. yield=1070
[thread_id=3] Submission complete. yield=963
[thread_id=4] Submission complete. yield=966
[thread_id=1] Failed IO count=0
[thread_id=0] Failed IO count=0
Reaped complete thread_id=0
[thread_id=2] Failed IO count=0
[thread_id=3] Failed IO count=0
[thread_id=4] Failed IO count=0
Reaped submit thread_id=1
Reaped complete thread_id=1
Reaped submit thread_id=2
Reaped complete thread_id=2
Reaped submit thread_id=3
Reaped complete thread_id=3
Reaped submit thread_id=4
Reaped complete thread_id=4

Failure without SQPOLL
[root@ip-10-0-58-7 liburing]# ./examples/iouring-object 2000 0
[thread_id=0] Submission complete. yield=0
[thread_id=1] Submission complete. yield=0
Reaped submit thread_id=0
[thread_id=3] Submission complete. yield=0
[thread_id=2] Submission complete. yield=0
[thread_id=4] Submission complete. yield=0
[thread_id=1] Failed IO count=0
[thread_id=2] Failed IO count=0
[thread_id=4] Failed IO count=0
^C

Failure with SQPOLL
[root@ip-10-0-58-7 liburing]# ./examples/iouring-object 5000 1
Initializing liburing with SQPOLL flag
[thread_id=4] Submission complete. yield=32130
[thread_id=1] Submission complete. yield=65518
[thread_id=3] Submission complete. yield=67878
[thread_id=0] Submission complete. yield=72433
Reaped submit thread_id=0
[thread_id=2] Submission complete. yield=73237
[thread_id=1] Failed IO count=0
[thread_id=3] Failed IO count=0
[thread_id=0] Failed IO count=0
Reaped complete thread_id=0
[thread_id=2] Failed IO count=0
Reaped submit thread_id=1
Reaped complete thread_id=1
Reaped submit thread_id=2
Reaped complete thread_id=2
Reaped submit thread_id=3
Reaped complete thread_id=3
Reaped submit thread_id=4
^C

liburing commit - a68caac
Linux kernel has been built from https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.4.1.tar.xz

Feature request: threading support

Threading support can be very useful in async programming. For example thread joining and condvar waiting.

Futex is a good start IMO.

Possible erroneous behavior when reusing SQEs

I'm issuing an accept() SQE subsequently followed by a connect() SQE. The connect result is success, however the accept() returns with CQE status ENOTCONN. Ignoring the error, the connected socket is fine and can issue I/O.

I presume this has something to do with the asynchronous connect case. Running with linux kernel at e31736d9fae841e8a1612f263136454af10f476a (12/14).

Question about versioning

Thanks for your work on io_uring, its a really stellar interface! I'm working on wrapping liburing in a Rust library to make it accessible from Rust (as well as higher level memory-safe integrations into our async/.await ecosystem, which haven't born fruit yet).

First, I just want to confirm this is the best place for you to receive questions & pull requests. Let me know if not.

My main question: what is the backwards compatibility story for liburing right now? In general, would you say you will not remove or break APIs exposed by liburing (excepting obviously __ functions for example)? I noticed that you recently removed the syscall helpers from liburing, but I think that was a special case because you expect them to be upstreamed to glibc.

I'm asking to determine my own versioning for my Rust wrappers. Most Rust users use cargo to perform version resolution for them, and cargo makes strong assumptions about backwards compatibility between "semver compatible" versions, so I just need to figure out if I should prepare for possible breaking changes between updates to liburing.

IORING_OP_READ for eventfd

I've been trying and failing to read from a fd opened with eventfd through IORING_OP_READ

int fd = eventfd(0, 0);
io_uring_prep_read(sqe, fd, &event, sizeof(eventfd_t), 0);

This fails consistently with EINVAL
Polling with io_uring_prep_poll_add(sqe, fd, 0); then reading with read(fd, &event, sizeof(eventfd_t)) works.

am I missing something or is reading for eventfd directly through io_uring not supported ?

Thanks

Possible bug with IOSQE_IO_DRAIN

Hi,
I'm trying to test various aspects of io_uring and came to fsync test in this lib.
I've noticed this line: https://github.com/axboe/liburing/blob/master/test/fsync.c#L117
and added just a printf to let me know if my kernel is ok with it or not.

And unfortunatelly it doesn't work.
I've tested it on kernels 5.2.x (where this flag was introduced) and kernel 5.3.x.
It just returns that error and I don't understand why..

When I've added IOSQE_IO_LINK to sumbission flags of flush operation, it started to run ok.

But it seems to be against what is documented and what is actually tested with the IOSQE_IO_DRAIN.

I've also tried to search the internet if someone already hasn't faced the same issue and found only this maybe relevant post: https://www.mail-archive.com/[email protected]/msg39033.html, but without any followup..

It this an issue or some misunderstanding? Thanks!

PS: This line https://github.com/axboe/liburing/blob/master/test/fsync.c#L101 should probably be if (ret == -EINVAL) but is unrelated to this problem.

Support asynchronous-but-blocking socket reads

I have an open socket from which I'm reading data, with the following behavior:

(1) If the socket is blocking, then preparing a read [io_uring_prep_readv] and submitting it [io_uring_submit] causes the submit to block until data can be read from the socket.

(2) If the socket is non-blocking, then doing (1) causes EAGAIN to be returned on the CQE, unless the socket has data available.

(3) If the socket is non-blocking, then polling the socket for input [io_uring_prep_poll_add + POLLIN], flagging it with IOSQE_IO_LINK, followed by a consecutive read SQE causes the error EINVAL to be returned on the poll CQE.

Ideally (3) would work and perform the read when the socket received input. In order to get it to work, I have to split up the poll and read and only submit the latter after I receive the former.

Likewise when sending data on a socket. In the rare occurrence where the output buffer is full, instead of registering a POLLOUT and retrying the write, it'd be nice to send the data and only have to worry about my total outstanding operations.

Perhaps I'm missing something, but can this be supported? Thanks!

Issue: IORING_OP_RECV returns -EFAULT constantly

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <arpa/inet.h>
#include <sys/types.h>
#include <sys/socket.h>

#include <liburing.h>

int main() {
    char buffer[1024];
    struct io_uring ring;
    struct sockaddr_in saddr;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    int sockfd = 0, clientfd = 0, ret;

    io_uring_queue_init(32, &ring, 0);

    sockfd = socket(AF_INET, SOCK_STREAM, 0);
    if (sockfd < 0) {
        perror("socket");
        goto err;
    }

    saddr = (struct sockaddr_in) {
        .sin_family = AF_INET,
        .sin_addr = {
            .s_addr = htonl(INADDR_ANY),
        },
        .sin_port = htons(12345),
    };

    ret = bind(sockfd, (struct sockaddr *)&saddr, sizeof(saddr));
    if (ret < 0) {
        perror("bind");
        goto err;
    }

    ret = listen(sockfd, 32);
    if (ret < 0) {
        perror("listen");
        goto err;
    }

    clientfd = accept(sockfd, NULL, NULL);
    if (clientfd < 0) {
        perror("accept");
        goto err;
    }

    sqe = io_uring_get_sqe(&ring);
    io_uring_prep_recv(sqe, clientfd, buffer, sizeof(buffer), 0);

    ret = io_uring_submit_and_wait(&ring, 1);
    if (ret <= 0) {
        perror("io_uring_submit_and_wait");
        goto err;
    }

    io_uring_peek_cqe(&ring, &cqe);

    if (cqe->res < 0) {
        printf("recv failed: %d\n", cqe->res);
        goto err;
    }

err:
    io_uring_queue_exit(&ring);
    close(clientfd);
    close(sockfd);
}
$ gcc recv.c -o recv -luring
$ ./recv # On another terminal: curl -v localhost:12345
recv failed: -14
$ uname -a # https://kernel.ubuntu.com/~kernel-ppa/mainline/daily/2020-01-31/
Linux carter-virtual-machine 5.5.0-999-generic #202001302109 SMP Fri Jan 31 02:15:05 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Issue: IORING_OP_READV sometimes returns 0 or -EFAULT after dropping caches

This issue only happens when you submit multiple read requests.

#include <liburing.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>

char str1[32768];
char str2[32768];

struct io_uring ring;

void prep(int fd, char *str) {
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    struct iovec iov = {
        .iov_base = str,
        .iov_len = sizeof(str1),
    };
    io_uring_prep_readv(sqe, fd, &iov, 1, 0);
    sqe->user_data = fd;
    io_uring_submit(&ring);
    printf("SUBMIT: %d\n", fd);
}

void wait() {
    struct io_uring_cqe *cqe;
    io_uring_wait_cqe(&ring, &cqe);
    io_uring_cqe_seen(&ring, cqe);
    printf("FINISH: %d with res %d\n", (int)cqe->user_data, cqe->res);
}

int main() {
    io_uring_queue_init(32, &ring, 0);
    int fd1 = open("/path/to/large/file1", O_RDONLY);
    int fd2 = open("/path/to/large/file2", O_RDONLY);

    prep(fd1, str1);
    prep(fd2, str2);
    wait();
    wait();

    close(fd2);
    close(fd1);
    io_uring_queue_exit(&ring);
}

Before executing the program, run sync && echo 3 > /proc/sys/vm/drop_caches && swapoff -a && swapon -a

You will get -EFAULT when debugging the program using GDB

(gdb) run
Starting program: /root/test/./test

SUBMIT: 8
SUBMIT: 9
FINISH: 9 with res -14
FINISH: 8 with res 32768
[Inferior 1 (process 16788) exited normally]
Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64

If you run it directly, the program will crash with segfault.

$ ./test
SUBMIT: 4
SUBMIT: 5
FINISH: 4 with res 32768
FINISH: 5 with res 3568
[1]    16893 segmentation fault (core dumped)  ./test

If you run the program again without dropping caches, it will work as expected

Linux localhost 5.4.0-1.el7.elrepo.x86_64 #1 SMP Mon Nov 25 09:18:09 EST 2019 x86_64 x86_64 x86_64 GNU/Linux

Original post: hakasenyang/openssl-patch#22 (comment)

EDIT: verified on 5.5rc too

Issue: io_uring_prep_connect always returns -EINPROGRESS

// connect.c
#include <liburing.h>
#include <stdio.h>
#include <netdb.h>
#include <sys/socket.h>
#include <unistd.h>

int main() {
    struct addrinfo hints = {
        .ai_family = AF_UNSPEC,
        .ai_socktype = SOCK_STREAM,
    }, *addr;

    if (getaddrinfo("github.com", "http", &hints, &addr) < 0) {
        return 1;
    }
    int clientfd = socket(addr->ai_family, addr->ai_socktype, addr->ai_protocol);
    if (clientfd < 0) return 2;

#ifndef USE_PLAIN_CONNECT
    struct io_uring ring;
    io_uring_queue_init(32, &ring, 0);

    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    io_uring_prep_connect(sqe, clientfd, addr->ai_addr, addr->ai_addrlen);
    io_uring_submit(&ring);

    struct io_uring_cqe *cqe;
    io_uring_wait_cqe(&ring, &cqe);
    io_uring_cqe_seen(&ring, cqe);
    int ret = cqe->res;

    io_uring_queue_exit(&ring);
#else
    int ret = connect(clientfd, addr->ai_addr, addr->ai_addrlen);
#endif

    printf("%d\n", ret);
    close(clientfd);
    return 0;
}
$ clang connect.c -luring -o connect && ./connect
-115
$ clang connect.c -luring -o connect -DUSE_PLAIN_CONNECT && ./connect
0
$ uname -a
Linux carter-virtual-machine 5.4.0-999-generic #201911282213 SMP Fri Nov 29 03:17:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Issue: IORING_OP_POLL_ADD with signalfd

#include <unistd.h>
#include <sys/signalfd.h>
#include <sys/poll.h>

#include <liburing.h>

int main() {
    sigset_t mask;
    sigemptyset(&mask);
    sigaddset(&mask, SIGINT);

    sigprocmask(SIG_BLOCK, &mask, NULL);
    int sfd = signalfd(-1, &mask, SFD_NONBLOCK);

    struct io_uring ring;
    io_uring_queue_init(32, &ring, 0);

    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    io_uring_prep_poll_add(sqe, sfd, POLLIN);
    io_uring_submit(&ring);

    struct io_uring_cqe *cqe;
    io_uring_wait_cqe(&ring, &cqe);
    io_uring_cqe_seen(&ring, cqe);
    io_uring_queue_exit(&ring);

    close(sfd);
    return 0;
}

Ctrl+C should terminate the program but it doesn't. Similar code works for epoll: https://gist.github.com/CarterLi/b8db2fcfea689b96eeae382c38130afb

Linux Ubuntu 5.3.0-10-generic #11-Ubuntu SMP Mon Sep 9 15:12:17 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Weird readv behavior on 5.5 kernel

On 5.5 kernel in case if file size is less than iovec size, cqe.res will be equal to 0, On 5.4 kernel cqe.res will contain correct number of bytes read

Sample code:

#include <errno.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/poll.h>

#include "liburing.h"

#define BUF_SIZE 4096
#define FILE_SIZE 1024

static int create_file(const char *file)
{
	ssize_t ret;
	char *buf;
	int fd;

	buf = malloc(FILE_SIZE);
	memset(buf, 0xaa, FILE_SIZE);

	fd = open(file, O_WRONLY | O_CREAT, 0644);
	if (fd < 0) {
		perror("open file");
		return 1;
	}
	ret = write(fd, buf, FILE_SIZE);
	close(fd);
	return ret != FILE_SIZE;
}

int main(int argc, char* argv[]) {
    int ret, fd;
    struct io_uring ring;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    struct iovec vec;

    vec.iov_base = malloc(BUF_SIZE);
    vec.iov_len = BUF_SIZE;

    if (create_file(".basic-r")) {
		fprintf(stderr, "file creation failed\n");
		return 1;
	}

    fd = open(".basic-r", O_RDONLY);
	if (fd < 0) {
		perror("file open");
		return 1;
	}

    ret = io_uring_queue_init(32, &ring, 0);
	if (ret)
		return ret;


    sqe = io_uring_get_sqe(&ring);
    if (!sqe) {
			fprintf(stderr, "sqe get failed\n");
			return 1;
	}

    io_uring_prep_readv(sqe, fd, &vec, 1, 0);
    
    ret = io_uring_submit(&ring);
    if (ret != 1) {
        return 1;
    }

    ret = io_uring_wait_cqes(&ring, &cqe, 1, 0, 0);
    if (ret) {
        return 1;
    }

    fprintf(stderr, "cqe res %d", cqe->res);
    io_uring_cqe_seen(&ring, cqe);
    return 0;
}

Issue: IORING_OP_RECVMSG sometimes returns -EFAULT ( regression? )

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <arpa/inet.h>
#include <sys/types.h>
#include <sys/socket.h>

#include <liburing.h>


int main() {
    char buffer[1024];
    struct io_uring ring;
    struct iovec iov;
    struct sockaddr_in saddr;
    struct msghdr msg;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    int sockfd = 0, clientfd = 0, ret;

    io_uring_queue_init(32, &ring, 0);

    sockfd = socket(AF_INET, SOCK_STREAM, 0);
    if (sockfd < 0) {
        perror("socket");
        goto err;
    }

    saddr = (struct sockaddr_in) {
        .sin_family = AF_INET,
        .sin_addr = {
            .s_addr = htonl(INADDR_ANY),
        },
        .sin_port = htons(12345),
    };

    ret = bind(sockfd, (struct sockaddr *)&saddr, sizeof(saddr));
    if (ret < 0) {
        perror("bind");
        goto err;
    }

    ret = listen(sockfd, 32);
    if (ret < 0) {
        perror("listen");
        goto err;
    }

    clientfd = accept(sockfd, NULL, NULL);
    if (clientfd < 0) {
        perror("accept");
        goto err;
    }

    iov = (struct iovec) {
        .iov_base = buffer,
        .iov_len = sizeof(buffer),
    };

    msg = (struct msghdr) {
        .msg_namelen = sizeof(struct sockaddr_in),
        .msg_iov = &iov,
        .msg_iovlen = 1,
    };

    sqe = io_uring_get_sqe(&ring);
    io_uring_prep_recvmsg(sqe, clientfd, &msg, 0);

    ret = io_uring_submit_and_wait(&ring, 1);
    if (ret <= 0) {
        perror("io_uring_submit_and_wait");
        goto err;
    }

    io_uring_peek_cqe(&ring, &cqe);

    if (cqe->res < 0) {
        printf("recvmsg failed: %d\n", cqe->res);
        goto err;
    }

err:
    io_uring_queue_exit(&ring);
    close(clientfd);
    close(sockfd);
}
$ clang -g -luring -o test test.c
$ ./test # On another terminal: curl -v localhost:12345
recvmsg failed: -14 # Not constantly
$ uname -a # https://kernel.ubuntu.com/~kernel-ppa/mainline/daily/2019-12-15/
Linux carter-virtual-machine 5.5.0-999-generic #201912142104 SMP Sun Dec 15 02:07:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

I can't reproduce it on Linux 5.4, may relates to https://lore.kernel.org/io-uring/[email protected]/T/#m919b41ecbf5049c15df15e8cbf2ff982acc37cc9

Issue: IORING_OP_TIMEOUT with IOSQE_IO_LINK always cancels the next operation, which makes it unusable

IORING_OP_TIMEOUT returns -ETIME when expires, which is considered an error and will breaks the entire link. As a result, operations after IORING_OP_TIMEOUT with IOSQE_IO_LINK will always be canceled.

#include <unistd.h>
#include <liburing.h>

int main() {
	struct io_uring ring;
	io_uring_queue_init(8, &ring, 0);

    struct io_uring_sqe *sqe1 = io_uring_get_sqe(&ring);
    struct __kernel_timespec ts = {
        .tv_sec = 1,
        .tv_nsec = 0,
    };
    io_uring_prep_timeout(sqe1, &ts, 0, 0);
    io_uring_sqe_set_flags(sqe1, IOSQE_IO_LINK);

    struct io_uring_sqe *sqe2 = io_uring_get_sqe(&ring);
    struct iovec iov = {
        .iov_base = "OK\n",
        .iov_len = sizeof("OK\n"),
    };
    io_uring_prep_writev(sqe2, STDERR_FILENO, &iov, 1, 0);
    io_uring_submit_and_wait(&ring, 2);

    io_uring_queue_exit(&ring);
}

Expected: waits 1s then print "OK"
Actual: nothing is printed

Discussion: performance about reading/writing a socket

When benchmarking an echo server written with io-uring, I found adding a poll_add sqe before readv/recvmsg could result in about 30% performance boost:

https://github.com/CarterLi/io_uring-echo-server/blob/switch/io_uring_echo_server.c#L14

131729 request/sec VS 98694 request/sec using rust_echo_bench.

That was unexpected. AFAIK readv/recvmsg is async operation itself, adding a poll_add sqe won't help but result in extra context switch ( because it will awake io_uring_enter ).

After some investigation, I found program without poll_add will create lots of kernel processes called io_wqe_worker. But program with poll_add won't.

image

Don't know how poll_add works, but it seems that poll_add has much lower cost then async read. Is it expected? And maybe a silly question, could we implement async read as poll-add and nonblocking read?

Consider supporting unvectored read/write?

io_uring currently only supports vectored reads and writes (except for the _fixed operations). While vectored reads and writes are in theory a superset of single reads and writes, the required indirection of the array of iovecs presents some problems.

In particular, I'm interested in creating a memory safe abstraction of io_uring's completion-based API in Rust. Realistically, the best way to do this is for the abstraction to have logical ownership of the buffers until the IO is complete. The naive solution would be to just always allocate an intermediate buffer, which would mean an extra allocation for every read or write operation. There are better solutions which avoid the allocation, but they can be tricky to implement.

It would be easier to create a safe API for unvectored read/write (the common case) if it were supported directly by the io_uring interface. Then the abstraction would only need to manage the lifetime of the actually buffer and not the indirection array as well.

Feature request: Allow sqe flags like IOSQE_IO_LINK for IORING_OP_TIMEOUT with off == 0

Currently we forbid flags for IORING_OP_TIMEOUT.

https://github.com/torvalds/linux/blob/63de37476ebd1e9bab6a9e17186dc5aa1da9ea99/fs/io_uring.c#L2456

I think it's reasonable for io_uring_wait_cqe_timeout. But for pure timeout ( ie REQ_F_TIMEOUT_NOSEQ ), this operation should behave like other operations and should allow for common sqe flags.

This is a valid usage:

  1. readv
  2. timeout(1s)
  3. writev

And this should be valid too:

  1. timeout(1s)
  2. timeout(1s)
  3. timeout(1s)

Issue: io_uring_prep_timeout seems use previous submitted timespec as its timespec

// test.c
#include <liburing.h>
#include <stdio.h>
#include <time.h>

int main() {
    struct io_uring ring;
    io_uring_queue_init(32, &ring, 0);
    printf("0: %ld\n", time(NULL));

{
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    io_uring_prep_nop(sqe);
    io_uring_submit(&ring);

    struct __kernel_timespec ts = {
        .tv_sec = 10,
        .tv_nsec = 0,
    };
    struct io_uring_cqe *cqe;
    io_uring_wait_cqe_timeout(&ring, &cqe, &ts);
    io_uring_cqe_seen(&ring, cqe);
    printf("1: %ld\n", time(NULL));
}
{
    struct __kernel_timespec ts = {
        .tv_sec = 1,
        .tv_nsec = 0,
    };
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    io_uring_prep_timeout(sqe, &ts, 0, 0);
    io_uring_submit(&ring);

    struct io_uring_cqe *cqe;
    io_uring_wait_cqe(&ring, &cqe);
    io_uring_cqe_seen(&ring, cqe);
    printf("2: %ld\n", time(NULL));
}

    io_uring_queue_exit(&ring);
    return 0;
}

Actual: The last io_uring_prep_timeout waits for 10s

$ clang test.c -luring -o test
$ ./test 
0: 1575128130
1: 1575128130
2: 1575128140

Expect: It should only wait for 1s

Linux carter-virtual-machine 5.4.0-999-generic #201911282213 SMP Fri Nov 29 03:17:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

https://kernel.ubuntu.com/~kernel-ppa/mainline/daily/2019-11-29/

user access

Is there a way to manage user access/permission?

For example say i use liburing to be a web server and run it as a "root" user maybe i am using "IORING_SETUP_SQPOLL". I wouldn't want my web server to run everything in root privileges, maybe some of the functions(read/write, ...) needs to run as other users.

Maybe this is something we can set in sqe

sqe = io_uring_get_sqe(ring)
sqe.setuid = 123

on error it would raise permission denied error.

Calling `io_uring_setup` from multiple threads

Hi, I've probably found another problem triggered when my tests run in parallel.
Should it be ok to setup unique io_uring independently from multiple threads?

I've created simple test to reproduce it (at least on my Ryzen 7 3700X with Fedora 31 kernel 5.3.12).

Basically it fails during io_uring_setup call with ENOMEM.
I've tried to follow this guide to figure something out, but I'm not into kernel dev so this all seems too much woodoo for me ;-)

Anyway, here's the trace output if it helps somehow:

  1)               |  __x64_sys_io_uring_setup() {
  1)               |    io_uring_setup() {
  1)               |      capable() {
  1)               |        ns_capable_common() {
  1)  <...>-633600  =>  <...>-633604 
  1)               |          security_capable() {
  1)  <...>-633604  =>  <...>-633600 
  1)   0.230 us    |            cap_capable();
  1)   0.762 us    |          }
  1)   1.202 us    |        }
  1)   1.623 us    |      }
  1)   0.270 us    |      free_uid();
  1)   2.735 us    |    }
  1)   3.467 us    |  }

And here is a test to reproduce it.

#include <stdio.h>
#include <pthread.h>
#include "liburing.h"

struct thread_info_t {
	pthread_t tid;
	int num;
};

static void *doTest(void *arg) {
	struct io_uring ring;
	struct io_uring_cqe *cqe;
	struct io_uring_sqe *sqe;
	struct thread_info_t *ti;
	int ret;

	ti = (struct thread_info_t *)arg;
	printf("%d: start\n", ti->num);

	ret = io_uring_queue_init(128, &ring, 0);
	if (ret) {
		printf("%d: ring setup failed: %d\n", ti->num, ret);
		return arg;
	}

	sqe = io_uring_get_sqe(&ring);
	if (!sqe) {
		printf("%d: get sqe failed\n", ti->num);
		return arg;
	}

	io_uring_prep_nop(sqe);

	ret = io_uring_submit(&ring);
	if (ret <= 0) {
		printf("%d: sqe submit failed: %d\n", ti->num, ret);
		return arg;
	}

	ret = io_uring_wait_cqe(&ring, &cqe);
	if (ret < 0) {
		printf("%d: wait completion %d\n", ti->num, ret);
		return arg;
	}

	io_uring_cqe_seen(&ring, cqe);
	printf("%d: done\n", ti->num);
	return NULL;
}

int main(int argc, char *argv[])
{
	struct thread_info_t threads[10];
	int ret;
	void *res;

	for (int i=0; i<10; i++) {
		threads[i].num = i;
		ret = pthread_create(&threads[i].tid, NULL, doTest, &threads[i]);
		if (ret) {
			fprintf(stderr, "Thread create failed\n");
			return 1;
		}
	}

	for (int i=0; i<10; i++) {
		ret = pthread_join(threads[i].tid, &res);
		if (ret) {
			fprintf(stderr, "Thread join failed\n");
			return 1;
		}
		if (res) {
			fprintf(stderr, "Test failed\n");
			return 1;
		}
	}

	return 0;
}

One of my outputs is:

0: start
1: start
2: start
3: start
4: start
4: ring setup failed: -12
5: start
5: ring setup failed: -12
6: start
0: done
6: ring setup failed: -12
3: done
7: start
7: ring setup failed: -12
1: done
8: start
8: ring setup failed: -12
9: start
9: ring setup failed: -12
2: done
Test failed

Feature request: add IOSQE flag which indicates specified sqe won't wake up io_uring_enter

For IOSQE_IO_LINK link chain, people usually don't care operations before the whole link chain is completed.

Currently io_uring_wait_cqe is awaked for every operation's completion. We have io_uring_wait_cqes can partially resolve this issue, but io_uring_wait_cqes has its own limitations:

  1. For program that uses event loops, it's not easy to pass the arguments in.
  2. There are common situations that the number of operations to wait cannot be determined. For example echo server: we have an IORING_OP_ACCEPT operation pending for new connection ( which needs io_uring_wait_cqes(1) ), and multiple RECV-SEND chains solving existing connections ( which needs io_uring_wait_cqes(2) ). As a result we have to use io_uring_wait_cqe.

Suggestion: add a new flag named IOSQE_IO_NO_AWAKE which indicates an operation should not awake io_uring_enter. It can resolve those 2 problems.

  1. IOSQE_IO_NO_AWAKE is set when preparing operations, we don't need to touch the global event loop
  2. IOSQE_IO_NO_AWAKE can be set for sqes separately. For example we set RECV(IOSQE_IO_NO_AWAKE)-SEND, then io_uring_wait_cqe should work fine.

Sorry for my bad English if I can't explain myself clearly.

Question: When does iov_base have to be posix_memalign'ed?

Hi,

I looked through the files under test and examples, and found some iov_base are allocated by posix_memalign and others by malloc or even char* literals.

I then did some experiments. It seemed to be that liburing does not care about memory alignment. Is this true? Thanks in advance.

if (posix_memalign(&iov.iov_base, 4096, 4096)) {

vecs.iov_base = "This is a pipe test\n";

Bug in io_uring_get_sqe?

io_uring_get_sqe sometimes fails to find vacant sqe when SQPOLL is enabled, but there is free space.Running following test case always produces io_uring_get_sqe failed, space left : 8 :

#include <errno.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/poll.h>
#include "liburing.h"

#define NUM_ENTRIES 8
int setup_and_run();

int main(int argc, char *argv[])
{
    for (int j = 0; j < 100; j++)
    {
        int ret = setup_and_run();
        if (ret)
        {
            return ret;
        }
    }
    return 0;
}

int setup_and_run()
{
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    struct io_uring_params p;
    struct io_uring ring;
    int ret, data;

    memset(&p, 0, sizeof(p));
    p.flags = IORING_SETUP_SQPOLL;
    ret = io_uring_queue_init_params(NUM_ENTRIES, &ring, &p);
    if (ret)
    {
        fprintf(stderr, "ring create failed: %d\n", ret);
        return 1;
    }

    if (p.sq_entries != NUM_ENTRIES)
    {
        fprintf(stderr, "ring create failed, wanted %d sq entries, got: %d entries\n", NUM_ENTRIES, ret);
        return 1;
    }

    for (int i = 0; i < NUM_ENTRIES; i++)
    {
        sqe = io_uring_get_sqe(&ring);
        if (!sqe)
        {
            fprintf(stderr, "io_uring_get_sqe failed\n");
            return ret;
        }

        io_uring_prep_nop(sqe);
        io_uring_sqe_set_data(sqe, (void *)(unsigned long)42);
    }

    ret = io_uring_submit(&ring);

    if (!ret)
    {
        fprintf(stderr, "io_uring_submit railed");
        return -1;
    }

    ret = io_uring_wait_cqe(&ring, &cqe);
    if (ret == 0)
    {
        data = (unsigned long)io_uring_cqe_get_data(cqe);
        if (data != 42)
        {
            fprintf(stderr, "invalid data: %d\n", data);
            return data;
        }

        int space_left = io_uring_sq_space_left(&ring);
        sqe = io_uring_get_sqe(&ring);
        if (sqe == NULL)
        {
            fprintf(stderr, "io_uring_get_sqe failed, space left: %d\n", space_left);
            return 1;
        }
    }
    else
    {
        fprintf(stderr, "io_uring_wait_cqe failed : %d\n", ret);
        return ret;
    }

    io_uring_queue_exit(&ring);
    return 0;
}

Unexpected CQE result -512 (recvmsg+cancel)

I have a socket open that has an asynchronous recvmsg (io_uring_prep_recvmsg + io_uring_sqe_set_data) outstanding. No data is being supplied by the other end. Subsequently, the recvmsg is being canceled (io_uring_prep_cancel). The CQE for the cancel is giving -114 (-EALREADY) which is expected, however the CQE for the recvmsg is receiving -512, which is not.

Not sure where the result is being generated. In the case it's a kernel issue, I'm testing with https://kernel.ubuntu.com/~kernel-ppa/mainline/daily/2019-12-01/.

Possible bug in __io_uring_submit

This program never terminates:

#include "liburing.h"
#include <stdio.h>

int main(int argc, char *argv[])
{
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    struct io_uring ring;
    int ring_flags, ret, data;

    ring_flags = IORING_SETUP_SQPOLL;
    ret = io_uring_queue_init(64, &ring, ring_flags);
    if (ret) {
        fprintf(stderr, "ring create failed: %d\n", ret);
        return 1;
    }

    sqe = io_uring_get_sqe(&ring);
    if (!sqe) {
        fprintf(stderr, "sqe get failed\n");
        return 1;
    }

    io_uring_prep_nop(sqe);
    io_uring_sqe_set_data(sqe, 42);
    io_uring_submit_and_wait(&ring, 1);

    ret = io_uring_peek_cqe(&ring, &cqe);
    if (ret) {
        fprintf(stderr, "cqe get failed\n");
        return 1;
    }
    data = io_uring_cqe_get_data(cqe);
    if (data != 42) {
        fprintf(stderr, "invalid data: %d\n", data);
        return 1;
    }
    return 0;
}


changing this line

if (wait_nr || sq_ring_needs_enter(ring, &flags)) {
to

if (sq_ring_needs_enter(ring, &flags) || wait_nr) {

fixes it. If i'm wrong, I'm sorry.

Suggestion: IORING_OP_TIMEOUT with sqe->off == 0

I noticed that for IORING_OP_TIMEOUT, if completion event count is not set, it defaults to 1. It's not very useful in my opinion. I suggest if sqe->off equals to 0, IORING_OP_TIMEOUT acts like a timer. That is to say, IORING_OP_TIMEOUT won't be completed through other requests' completion.

With this change, timerfd can be partially replaced. interval is not suit for io_uring though.

Yes it's a breaking change. sqe->off == -1 is also considerable.

IORING_SETUP_SQPOLL randomly fails IOs

My toy program spins up two threads. Submits IOs from one thread and reaps IOs from the other. I have setup liburing using IORING_SETUP_SQPOLL flag.

Following is my code:

#include <errno.h>
#include <fcntl.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#include "liburing.h"

#define DEVICE_SIZE (512ULL << 30)
#define BS 4096
#define QD 32

static struct io_uring ring;
static int dev_fd;

static void *setup_iov_base(size_t size)
{
        void *buf;
        int fd;

        if (posix_memalign(&buf, BS, size) != 0) {
                printf("mem aligned failed\n");
                return NULL;
        }

        fd = open("/dev/urandom", O_RDONLY);
        if (fd < 0) {
                printf("Failed to open urandom. rc=%d\n", fd);
                return NULL;
        }

        read(fd, buf, size);
        close(fd);
 
                                                                                                                                                                                         [76/282]
        return buf;
}

static int init(void)
{
        struct io_uring_params p = { 0 };
        time_t t;
        int rc;

        /* Implies no syscalls to submit IOs */
        p.flags = IORING_SETUP_SQPOLL;
        rc = io_uring_queue_init_params(QD, &ring, &p);
        if (rc != 0) {
                printf("queue_init failed. rc=%d\n", rc);
                return rc;
        }

        dev_fd = open("/dev/nvme1n1", O_RDWR | O_DIRECT);
        if (dev_fd < 0) {
                printf("Failed to open nvme device. rc=%d\n", dev_fd);
                return dev_fd;
        }

        /* SQPOLL only works with fixed files. */
        rc = io_uring_register_files(&ring, &dev_fd, 1);
        if (rc < 0) {
                printf("Failed to register files. rc=%d\n", rc);
                return rc;
        }

        srand((unsigned) time(&t));

        return 0;
}

static inline void submit_to_kernel(char *failure_message)
{
        int rc;

        rc = io_uring_submit(&ring);
        if (rc < 0) {
                printf("%s. rc=%d\n", failure_message, rc);
        }
}

static struct io_uring_sqe *get_sqe(int *yield)
{
        struct io_uring_sqe *sqe;

        while ((sqe = io_uring_get_sqe(&ring)) == NULL) {
                /* Kick kernel thread if it is taking a nap */
                submit_to_kernel("Failure to wake napping thread");

                *yield = *yield + 1;
                /* TODO: Use condition variables */
                pthread_yield();
        }

        return sqe;
}

static void *submit_io(void *input)
{
        char *buf = setup_iov_base(BS);
        off_t offset = 0;
        int total_ios = *((int *)input);
        int yield = 0;
       while (total_ios != 0) {
                struct io_uring_sqe *sqe = get_sqe(&yield);
                struct iovec iov = {
                        .iov_base = buf,
                        .iov_len = BS,
                };

                io_uring_prep_writev(sqe, 0, &iov, 1, offset);

                sqe->flags |= IOSQE_FIXED_FILE;
                sqe->user_data = offset;

                total_ios--;
                if (total_ios % QD == 0) {
                        submit_to_kernel("Failed to submit new IO");
                }

                offset += BS;
        }

        printf("submit_io yield %d times\n", yield);

        return NULL;
}

static void *reap_io_completions(void *input)
{
        int total_ios = *((int *)input);
        int failed_ios = 0;

        while (total_ios != 0) {
                struct io_uring_cqe *cqe = NULL;
                /* This call blocks if no CQE entries are available */
                int rc = io_uring_wait_cqe(&ring, &cqe);
                if (rc < 0 || cqe->res != BS) {
                        printf("rc=%d cqe->res=%d offset=%llu\n", rc, cqe->res, cqe->user_data);
                        failed_ios++;
                }

                total_ios--;
                io_uring_cqe_seen(&ring, cqe);
        }

        printf("Failed IO count=%d\n", failed_ios);

        return NULL;
}

int main(int argc, char *argv[])
{
        pthread_t submit, complete;
        int total_ios;
        int rc;

        if (argc != 2) {
                printf("Expected two arguments\n");
                return -EINVAL;
        }

        total_ios = atoi(argv[1]);

        rc = init();
        if (rc != 0) {
                return rc;
        }

        rc = pthread_create(&submit, NULL, submit_io, &total_ios);
        if (rc < 0) {
                printf("Failed to create submit thread. rc=%d\n", rc);
                return rc;
        }

        rc = pthread_create(&complete, NULL, reap_io_completions, &total_ios);

        if (rc < 0) {
                printf("Failed to create complete thread. rc=%d\n", rc);
                return rc;
        }

        pthread_join(submit, NULL);
        pthread_join(complete, NULL);

        io_uring_queue_exit(&ring);
        close(dev_fd);

        return 0;
}

Here is my output for multiple runs:

[root@ip-10-0-58-7 liburing]# ./examples/iouringthread 65
submit_io yield 74 times
rc=0 cqe->res=-14 offset=258048
rc=0 cqe->res=-14 offset=262144
Failed IO count=2

[root@ip-10-0-58-7 liburing]# ./examples/iouringthread 65
submit_io yield 194 times
rc=0 cqe->res=-14 offset=204800
rc=0 cqe->res=-14 offset=208896
rc=0 cqe->res=-14 offset=212992
rc=0 cqe->res=-14 offset=217088
rc=0 cqe->res=-14 offset=221184
rc=0 cqe->res=-14 offset=225280
rc=0 cqe->res=-14 offset=229376
rc=0 cqe->res=-14 offset=233472
rc=0 cqe->res=-14 offset=237568
rc=0 cqe->res=-14 offset=241664
rc=0 cqe->res=-14 offset=245760
rc=0 cqe->res=-14 offset=249856
rc=0 cqe->res=-14 offset=253952
rc=0 cqe->res=-14 offset=258048
rc=0 cqe->res=-14 offset=262144
Failed IO count=15

[root@ip-10-0-58-7 liburing]# ./examples/iouringthread 65
submit_io yield 69 times
rc=0 cqe->res=-14 offset=196608
rc=0 cqe->res=-14 offset=200704
rc=0 cqe->res=-14 offset=204800
rc=0 cqe->res=-14 offset=208896
rc=0 cqe->res=-14 offset=212992
rc=0 cqe->res=-14 offset=217088
rc=0 cqe->res=-14 offset=221184
rc=0 cqe->res=-14 offset=225280
rc=0 cqe->res=-14 offset=229376
rc=0 cqe->res=-14 offset=233472
rc=0 cqe->res=-14 offset=237568
rc=0 cqe->res=-14 offset=241664
rc=0 cqe->res=-14 offset=245760
rc=0 cqe->res=-14 offset=249856
rc=0 cqe->res=-14 offset=253952
rc=0 cqe->res=-14 offset=258048
rc=0 cqe->res=-14 offset=262144
Failed IO count=17

If I update the code to not use SQPOLL it works just fine.

liburing commit ID - a68caac

question: io_uring_enter EAGAIN return

hi, I have a question about io_uring_enter and EAGAIN.

When to_submit is zero, can io_uring_enter return EAGAIN?

When to_submit is not zero, can io_uring_enter return 0? And if so, when does it return 0, and when EAGAIN?

privilege requirement for using SQPOLL

Calling io_uring_setup with ...SQPOLL returns -1 with errno = 1 (EPERM)
After several failed search on documentations, I eventually found in the kernel code it requires CAP_SYS_ADMIN.

Could you add a few words in liburing's comments, or maybe in a newer version of "Efficient IO with io_uring" to warn new SQPOLL enthusiasts of this possible error?

BTW, this privilege check really makes it hard to use SQPOLL since the user process have to run with "escalated" privilege level, or no SQPOLL at all...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.