Giter Club home page Giter Club logo

async-rdma's Introduction

DatenLord

Join the chat at https://gitter.im/datenlord/datenlord codecov


DatenLord is a next-generation cloud-native distributed storage platform, which aims to meet the performance-critical storage needs from next-generation cloud-native applications, such as microservice, serverless, AI, etc. On one hand, DatenLord is designed to be a cloud-native storage system, which itself is distributed, fault-tolerant, and graceful upgrade. These cloud-native features make DatenLord easy to use and easy to maintain. On the other hand, DatenLord is designed as an application-orientated storage system, in that DatenLord is optimized for many performance-critical scenarios, such as databases, AI machine learning, big data. Meanwhile, DatenLord provides high-performance storage service for containers, which facilitates stateful applications running on top of Kubernetes (K8S). The high performance of DatenLord is achieved by leveraging the most recent technology revolution in hardware and software, such as NVMe, non-volatile memory, asynchronous programming, and the native Linux asynchronous IO support.


Why DatenLord?

Why do we build DatenLord? The reason is two-fold:

  • Firstly, the recent computer hardware architecture revolution stimulates storage software refractory. The storage related functionalities inside Linux kernel haven't changed much in recent 10 years, whenas hard-disk drive (HDD) was the main storage device. Nowadays, solid-state drive (SSD) becomes the mainstream, not even mention the most advanced SSD, NVMe and non-volatile memory. The performance of SSD is hundreds of times faster than HDD, in that the HDD latency is around 1~10 ms, whereas the SSD latency is around 50–150 μs, the NVMe latency is around 25 μs, and the non-volatile memory latency is 350 ns. With the performance revolution of storage devices, traditional blocking-style/synchronous IO in Linux kernel becomes very inefficient, and non-blocking-style/asynchronous IO is much more applicable. The Linux kernel community already realized that, and recently Linux kernel has proposed native-asynchronous IO mechanism, io_uring, to improve IO performance. Beside blocking-style/synchronous IO, the context switch overhead in Linux kernel becomes no longer negligible w.r.t. SSD latency. Many modern programming languages have proposed asynchronous programming, green thread or coroutine to manage asynchronous IO tasks in user space, in order to avoid context switch overhead introduced by blocking IO. Therefore we think it’s time to build a next-generation storage system that takes advantage of the storage performance revolution as far as possible, by leveraging non-blocking/asynchronous IO, asynchronous programming, NVMe, and even non-volatile memory, etc.

  • Secondly, most distributed/cloud-native systems are computing and storage isolated, that computing tasks/applications and storage systems are of dedicated clusters, respectively. This isolated architecture is best to reduce maintenance, that it decouples the maintenance tasks of computing clusters and storage clusters into separate ones, such as upgrade, expansion, migration of each cluster respectively, which is much simpler than of coupled clusters. Nowadays, however, applications are dealing with much larger datasets than ever before. One notorious example is that an AI training job takes one hour to load data whereas the training job itself finishes in only 45 minutes. Therefore, isolating computing and storage makes IO very inefficient, as transferring data between applications and storage systems via network takes quite a lot of time. Further, with the isolated architecture, applications have to be aware of the different data location, and the varying access cost due to the difference of data location, network distance, etc. DatenLord tackles the IO performance issue of isolated architecture in a novel way, which abstracts the heterogeneous storage details and makes the difference of data location, access cost, etc, transparent to applications. Furthermore, with DatenLord, applications can assume all the data to be accessed are local, and DatenLord will access the data on behalf of applications. Besides, DatenLord can help K8S to schedule jobs close to cached data, since DatenLord knows the exact location of all cached data. By doing so, applications are greatly simplified w.r.t. to data access, and DatenLord can leverage local cache, neighbor cache, and remote cache to speed up data access, so as to boost performance.


Target scenarios

The main scenario of DatenLord is to facilitate high availability across multi-cloud, hybrid-cloud, multiple data centers, etc. Concretely, there are many online business providers whose business is too important to afford any downtime. To achieve high availability, the service providers have to leverage multi-cloud, hybrid-cloud, and multiple data centers to hopefully avoid single point failure of each single cloud or data center, by deploying applications and services across multiple clouds or data centers. It's relatively easier to deploy applications and services to multiple clouds and data centers, but it's much harder to duplicate all data to all clouds or all data centers in a timely manner, due to the huge data size. If data is not equally available across multiple clouds or data centers, the online business might still suffer from single point failure of a cloud or a data center, because data unavailability resulted from a cloud or a data center failure.

DatenLord can alleviate data unavailable of cloud or data center failure by caching data to multiple layers, such as local cache, neighbor cache, remote cache, etc. Although the total data size is huge, the hot data involved in online business is usually of limited size, which is called data locality. DatenLord leverages data locality and builds a set of large scale distributed and automatic cache layers to buffer hot data in a smart manner. The benefit of DatenLord is two-fold:

  • DatenLord is transparent to applications, namely DatenLord does not need any modification to applications;
  • DatenLord is high performance, that it automatically caches data by means of the data hotness, and it's performance is achieved by applying different caching strategies according to target applications. For example, least recent use (LRU) caching strategy for some kind of random access, most recent use (MRU) caching strategy for some kind of sequential access, etc.

Architecture

Single Data Center

DatenLord Single Data Center

Multiple Data Centers and Hybrid Cloud

DatenLord Multiple Data Centers and Hybrid Cloud

DatenLord provides 3 kinds of user interfaces: KV interface, S3 interface and file interface. The backend storage is supported by the underlying distributed cache layer which is strong consistent. The strong consistency is guaranteed by the metadata management module which is built on high performance consensus protocol. The persistence storage layer can be local disk or S3 storage. For the network, RDMA is used to provide high throughput and low latency networks. If RDMA is not supported, TCP is an alternative option. For the multiple data center and hybrid clouds scenario, there will be a dedicated metadata server which supports metadata requests within the same data center. While in the same data center scenario, the metadata module can run on the same machine as the cache node. The network between data centers and public clouds are managed by a private network to guarantee high quality data transfer.

Quick Start

Currently DatenLord has been built as Docker images and can be deployed via K8S.

To deploy DatenLord via K8S, just simply run:

  • sed -e 's/e2e_test/latest/g' scripts/setup/datenlord.yaml > datenlord-deploy.yaml
  • kubectl apply -f datenlord-deploy.yaml

To use DatenLord, just define PVC using DatenLord Storage Class, and then deploy a Pod using this PVC:

cat <<EOF >datenlord-demo.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-datenlord-test
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 100Mi
  storageClassName: csi-datenlord-sc

---
apiVersion: v1
kind: Pod
metadata:
  name: mysql-datenlord-test
spec:
    containers:
    - name: mysql
      image: mysql
      env:
      - name: MYSQL_ROOT_PASSWORD
        value: "rootpasswd"
      volumeMounts:
      - mountPath: /var/lib/mysql
        name: data
        subPath: mysql
    volumes:
    - name: data
      persistentVolumeClaim:
        claimName: pvc-datenlord-test
EOF

kubectl apply -f datenlord-demo.yaml

DatenLord provides a customized scheduler which implements K8S scheduler extender. The scheduler will try to schedule a pod to the node that has the volume that it requests. To use the scheduler, add schedulerName: datenlord-scheduler to the spec of your pod. Caveat: dangling docker image may cause failed to parse request error. Doing docker image prune on each K8S node is a way to fix it.

It may need to install snapshot CRD and controller on K8S, if used K8S CSI snapshot feature:

  • kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml
  • kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
  • kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml
  • kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/deploy/kubernetes/snapshot-controller/rbac-snapshot-controller.yaml
  • kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/deploy/kubernetes/snapshot-controller/setup-snapshot-controller.yaml

Monitoring

Datenlord monitoring guideline is in datenlord monitoring. We provide both YAML and Helm method to deploy the monitoring system.

To use YAML method, just run

sh ./scripts/setup/datenlord-monitor-deploy.sh

To use Helm method, run

sh ./scripts/setup/datenlord-monitor-deploy.sh helm

Performance Test

Performance test is done by fio and fio-plot is used to plot performance histograms.

To run performance test,

sudo apt-get update
sudo apt-get install -y fio python3-pip
sudo pip3 install matplotlib numpy fio-plot
sh ./scripts/perf/fio-perf-test.sh TEST_DIR

Four histograms will be generated.

  • Random read IOPS and latency for different block sizes
  • Random write IOPS and latency for different block sizes
  • Random read IOPS and latency for different read thread numbers with 4k block size
  • Random write IOPS and latency for different write thread numbers with 4k block size

Performance test is added to GitHub Action(cron.yml) and performance report is generated and archived as artifacts(Example) for every four hours.

How to Contribute

Anyone interested in DatenLord is welcomed to contribute.

Coding Style

Please follow the code style. Meanwhile, DatenLord adopts very strict clippy linting, please fix every clippy warning before submit your PR. Also please make sure all CI tests are passed.

Continuous Integration (CI)

The CI of DatenLord leverages GitHub Action. There are two CI flows for DatenLord, One is for Rust cargo test, clippy lints, and standard filesystem E2E checks; The other is for CSI related tests, such as CSI sanity test and CSI E2E test.

The CSI E2E test setup is a bit complex, its action script cron.yml is quite long, so let's explain it in detail:

  • First, it sets up a test K8S cluster with one master node and three slave nodes, using Kubernetes in Docker (KinD);

  • Second, CSI E2E test requires no-password SSH login to each K8S slave node, since it might run some commands to prepare test environment or verify test result, so it has to setup SSH key to each Docker container of KinD slave nodes;

  • Third, it builds DatenLord container images and loads to KinD, which is a caveat of KinD, in that KinD puts K8S nodes inside Docker containers, thus kubelet cannot reach any resource of local host, and KinD provides load operation to make the container images from local host visible to kubelet;

  • At last, it deploys DatenLord to the test K8S cluster, then downloads pre-build K8S E2E binary, runs in parallel by involking ginkgo -p, and only selects External.Storage related CSI E2E testcases to run.

Sub-Projects

DatenLord has several related sub-projects, mostly working in progress, listed alphabetically:

Road Map

  • 0.1 Refactor async fuse lib to provide clear async APIs, which is used by the datenlord filesystem.
  • 0.2 Support all Fuse APIs in the datenlord fs.
  • 0.3 Make fuse lib fully asynchronous. Switch async fuse lib's device communication channel from blocking I/O to io_uring.
  • 0.4 Complete K8S integration test.
  • 0.5 Support RDMA.
  • 1.0 Complete Tensorflow K8S integration and finish performance comparison with raw fs.

async-rdma's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

async-rdma's Issues

contribution

How can I contribute to the project? Is there any discord or slack channel to discuss?

Potential unsoundness

There is no guarantee that the safety requirements are met here.

https://doc.rust-lang.org/std/slice/fn.from_raw_parts.html#safety

/// Get the memory region as slice
#[inline]
fn as_slice(&self) -> &[u8] {
unsafe { slice::from_raw_parts(self.as_ptr(), self.length()) }
}

/// Get the memory region as mut slice
#[inline]
fn as_mut_slice(&mut self) -> &mut [u8] {
unsafe { slice::from_raw_parts_mut(self.as_mut_ptr(), self.length()) }
}

Large array transfer problem

Env

Two Ubuntu20.04 LTS clients for Vmware

Description

The two clients can perform RDMA large array operations separately by binding 127.0.0.1, However, when one client performs RDMA operations to the other on a large array, no packets are sent, I used Wireshark but did not catch packets, Operating with a small array can catch packets.

By the way, for arrays that exceed the size of the ulimit-L parameter, I do not need to unlock the limited memory size permissions for RDMA on a single client. When two clients perform RDMA operations, the large array needs to unlock the limit of Ulimit-L in order to connect

the two clients rust code is as following:

// server
use async_rdma::{LocalMrReadAccess, LocalMrWriteAccess, Rdma, RdmaListener};
use portpicker::pick_unused_port;
use std::{
    alloc::Layout,
    io,
    net::{Ipv4Addr, SocketAddrV4},
    time::Duration,
};

const LEN:usize = 30 * 1024 * 1024;

async fn server(addr: SocketAddrV4) -> io::Result<()> {
    let rdma_listener = RdmaListener::bind(addr).await?;
    let rdma = rdma_listener.accept(1, 1, 512).await?;
    // receive the immediate writen by client
    let imm_v = rdma.receive_write_imm().await?;
    //  print the immdedaite value
    print!("immediate value: {}", imm_v);
    // receive the metadata of the mr sent by client
    let lmr = rdma.receive_local_mr().await?;
    // print the content of lmr, which was `write` by client
    unsafe { println!("{:?}", *(lmr.as_ptr() as *const [i32;LEN])) };
    // wait for the agent thread to send all reponses to the remote.
    tokio::time::sleep(Duration::from_secs(1)).await;
    Ok(())
}
#[tokio::main]
async fn main() {
    let addr = SocketAddrV4::new(Ipv4Addr::new(192, 168, 59, 131), 8081);
    server(addr).await.unwrap();
    tokio::time::sleep(Duration::from_secs(3)).await;
}


// client
use async_rdma::{LocalMrReadAccess, LocalMrWriteAccess, Rdma, RdmaListener};
use portpicker::pick_unused_port;
use std::{
    alloc::Layout,
    io,
    net::{Ipv4Addr, SocketAddrV4},
    time::Duration,
};


const LEN:usize = 30 * 1024 * 1024;

async fn client(addr: SocketAddrV4) -> io::Result<()> {
    let rdma = Rdma::connect(addr, 1, 1, 512).await?;
    let mut lmr = rdma.alloc_local_mr(Layout::new::<[i32;LEN]>())?;
    let mut rmr = rdma.request_remote_mr(Layout::new::<[i32;LEN]>()).await?;
    // load data into lmr
    unsafe { *(lmr.as_mut_ptr() as *mut [i32;LEN]) = [0;LEN] };
    // write the content of local mr into remote mr  with immediate value
    rdma.write_with_imm(&lmr, &mut rmr, 1).await?;
    // then send rmr's metadata to server to make server aware of it
    rdma.send_remote_mr(rmr).await?;
    Ok(())
}
#[tokio::main]
async fn main() {
    let addr = SocketAddrV4::new(Ipv4Addr::new(192, 168, 59, 131), 8081);
    client(addr)
        .await
        .map_err(|err| println!("{}", err))
        .unwrap();
    tokio::time::sleep(Duration::from_secs(3)).await;
}

Optimize the poll workflow of CQ

Unnecessary allocation: the only one entry can be writted into a stack slot (MaybeUninit<WorkCompletion>) instead of a Vec.

/// Poll one work completion from CQ
pub(crate) fn poll_single(&self) -> io::Result<WorkCompletion> {
let polled = self.poll(1)?;
polled
.into_iter()
.next()
.ok_or_else(|| io::Error::new(io::ErrorKind::WouldBlock, ""))
}

Yes, poll only used by single_poll for now, so the allocation of vector is unnecessary.
single_poll is an interim scheme, we will optimize the poll workflow latter and track it in a new issue.

Originally posted by @GTwhy in #49 (comment)

Currently, event_listener only poll one CQE at a time, and the poll workflow can be optimized to poll more at a time.

Unnecessary overflow check

left.overflow_shl(32) never panics because 0 <= time < 1e6. It can be written as left.wrapping_shl(32).

async-rdma/src/id.rs

Lines 5 to 20 in 2a55f32

/// Creat a random u64 id.
///
/// Both `WorkRequetId` and `AgentRequestId` depend on this, so make this fn independent.
/// To avoid id duplication, this fn concatenates `SystemTime` and random number into a U64.
/// The syscall may have some overhead, which can be improved later by balancing the pros and cons.
pub(crate) fn random_u64() -> u64 {
let start = SystemTime::now();
// No time can be earlier than Unix Epoch
#[allow(clippy::unwrap_used)]
let since_the_epoch = start.duration_since(UNIX_EPOCH).unwrap();
let time = since_the_epoch.subsec_micros();
let rand = rand::thread_rng().gen::<u32>();
let left: u64 = time.into();
let right: u64 = rand.into();
left.overflow_shl(32) | right
}

self.addr().overflow_add(i.start) and i.end.overflow_sub(i.start) never panic because the bounds have been checked.

(Should "wrong range of lmr" be an IO error?)

/// Get a local mr slice
#[inline]
pub fn get(&self, i: Range<usize>) -> io::Result<LocalMrSlice> {
// SAFETY: `self` is checked to be valid and in bounds above.
if i.start >= i.end || i.end > self.length() {
Err(io::Error::new(io::ErrorKind::Other, "wrong range of lmr"))
} else {
Ok(LocalMrSlice::new(
self,
self.addr().overflow_add(i.start),
i.len(),
))
}
}
/// Get a mutable local mr slice
#[inline]
pub fn get_mut(&mut self, i: Range<usize>) -> io::Result<LocalMrSliceMut> {
// SAFETY: `self` is checked to be valid and in bounds above.
if i.start >= i.end || i.end > self.length() {
Err(io::Error::new(io::ErrorKind::Other, "wrong range of lmr"))
} else {
Ok(LocalMrSliceMut::new(
self,
self.addr().overflow_add(i.start),
i.len(),
))
}
}

/// Get a remote mr slice
#[inline]
pub fn get(&self, i: Range<usize>) -> io::Result<RemoteMrSlice> {
// SAFETY: `self` is checked to be valid and in bounds above.
if i.start >= i.end || i.end > self.length() {
Err(io::Error::new(io::ErrorKind::Other, "wrong range of rmr"))
} else {
let slice_token = MrToken {
addr: self.addr().overflow_add(i.start),
len: i.end.overflow_sub(i.start),
rkey: self.rkey(),
};
Ok(RemoteMrSlice::new_from_token(self, slice_token))
}
}
/// Get a mutable remote mr slice
#[inline]
pub fn get_mut(&mut self, i: Range<usize>) -> io::Result<RemoteMrSliceMut> {
// SAFETY: `self` is checked to be valid and in bounds above.
if i.start >= i.end || i.end > self.length() {
Err(io::Error::new(io::ErrorKind::Other, "wrong range of rmr"))
} else {
let slice_token = MrToken {
addr: self.addr().overflow_add(i.start),
len: i.end.overflow_sub(i.start),
rkey: self.rkey(),
};
Ok(RemoteMrSliceMut::new_from_token(self, slice_token))
}
}

start.overflow_add(self.max_sr_data_len) can be writtern as start.saturating_add(self.max_sr_data_len)

end.overflow_sub(start) never panics because start < end.

async-rdma/src/agent.rs

Lines 175 to 178 in 2441414

let end = (start.overflow_add(self.max_sr_data_len)).min(lm_len);
let kind = RequestKind::SendData(SendDataRequest {
len: end.overflow_sub(start),
});

Cancellation safety

Disclaimer: I know nothing about the details of RDMA before.

I have some questions about the example at first glance.

/// Server can't aware of rdma `read` or `write`, so we need sync with client
/// before client `read` and after `write`.
async fn sync_with_client(rdma: &Rdma) {
let mut lmr_sync = rdma
.alloc_local_mr(Layout::new::<Request>())
.map_err(|err| println!("{}", &err))
.unwrap();
//write data to lmr
unsafe { *(lmr_sync.as_mut_ptr() as *mut Request) = Request::Sync };
rdma.send(&lmr_sync)
.await
.map_err(|err| println!("{}", &err))
.unwrap();
}

  1. When will the local memory region be deallocated?
  2. Does the unsafe statement have any safety requirements?
  3. What if the future returned by sync_with_client is dropped (cancelled) before the request completes?

Related article: https://zhuanlan.zhihu.com/p/346219893

Connection timeout between servers

I'm new to RDMA in general so I'm just trying to explore how this works. I tried the cargo run --example rpc on my localhost and it runs fine. However when I try to run this between 2 servers I get a timeout error once the client connects. I'm using the cargo run --example client and cargo run --example server. Any troubleshooting tips?

Encounter a doubt when using MR in async-rdma

In studying this library, the following result were encountered. And I was confused about that.
request structure:
Screenshot 2023-03-01 at 22 48 29
client side:
Screenshot 2023-03-01 at 22 48 37
server side:
Screenshot 2023-03-01 at 22 48 58
main:
Screenshot 2023-03-01 at 22 48 20
result:

rpc server started
header size: 28
size: 80, 
Msg { header: RequestHeader { id: 0, type: 0, flags: 0, total_length: 0, file_path_length: 0, meta_data_length: 0, data_length: 0 }, meta_data: [], data: [] }
server side size: 80, 
Msg { header: RequestHeader { id: 0, type: 0, flags: 0, total_length: 6, file_path_length: 0, meta_data_length: 3, data_length: 3 }, meta_data: [1, 2, 3, 4, 5], data: [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2] }
response: Msg { header: ResponseHeader { id: 0, status: 0, flags: 0, total_length: 0, meta_data_length: 0, data_length: 0 }, meta_data: [1, 2, 3, 4, 5], data: [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4] }

Check more hardware limitations

    Shall we also check other hardware limitations?

Originally posted by @rogercloud in #89 (comment)

We may get errors returned by ibv APIs with ambiguous error messages because of hardware limitations.

We can check more hardware limitations, capture more errors and add some debug information.

confusion about support for Infiniband and mlx5?

(First I have to declare that I am really a novice to RDMA ...)

This library works pretty fine when I am playing RDMA based on SoftiWARP. However now when I am trying to apply all my code and tool to a platform equipped with the MLX5 card, I failed to connect my rdma server and client (by specifying ip:port) anymore.

Sorry that this question might be stupid but I am not quite clear about the relationship and difference between these concepts: RoCE, RXE, iWARP, MLX, Infiniband(IB) ...

Does this repo only support RoCE but not Infiniband(IB)?

run example server panic

I use the command-server example. It looks None for the value `imm" and can't unwrap.

/data/async-rdma$ RUST_BACKTRACE=1 cargo run --example server localhost 9527
    Finished dev [unoptimized + debuginfo] target(s) in 0.08s
     Running `target/debug/examples/server localhost 9527`
server start
accepted
[1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1], None
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', examples/server.rs:32:27
stack backtrace:
   0: rust_begin_unwind
             at /rustc/2c8cc343237b8f7d5a3c3703e3a87f2eb2c54a74/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/2c8cc343237b8f7d5a3c3703e3a87f2eb2c54a74/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/2c8cc343237b8f7d5a3c3703e3a87f2eb2c54a74/library/core/src/panicking.rs:114:5
   3: core::option::Option<T>::unwrap
             at /rustc/2c8cc343237b8f7d5a3c3703e3a87f2eb2c54a74/library/core/src/option.rs:823:21
   4: server::receive_data_with_imm_from_client::{{closure}}
             at ./examples/server.rs:32:23
   5: server::main::{{closure}}
             at ./examples/server.rs:103:45
   6: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at /home/zhaoxu/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/tokio-1.28.1/src/runtime/park.rs:283:63
   7: tokio::runtime::coop::with_budget
             at /home/zhaoxu/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/tokio-1.28.1/src/runtime/coop.rs:107:5
   8: tokio::runtime::coop::budget
             at /home/zhaoxu/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/tokio-1.28.1/src/runtime/coop.rs:73:5
   9: tokio::runtime::park::CachedParkThread::block_on
             at /home/zhaoxu/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/tokio-1.28.1/src/runtime/park.rs:283:31
  10: tokio::runtime::context::BlockingRegionGuard::block_on
             at /home/zhaoxu/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/tokio-1.28.1/src/runtime/context.rs:315:13
  11: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
             at /home/zhaoxu/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/tokio-1.28.1/src/runtime/scheduler/multi_thread/mod.rs:66:9
  12: tokio::runtime::runtime::Runtime::block_on
             at /home/zhaoxu/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/tokio-1.28.1/src/runtime/runtime.rs:304:45
  13: server::main


/data/async-rdma$ RUST_BACKTRACE=1 cargo run --example client localhost 9527
    Finished dev [unoptimized + debuginfo] target(s) in 0.08s
     Running `target/debug/examples/client localhost 9527`
client start
connected
[0, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 2, 2, 2, 2]
^[[A^[[A^[[A^[[Athread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Custom { kind: TimedOut, error: "Timeout for waiting for a response." }', examples/client.rs:134:37
stack backtrace:
   0: rust_begin_unwind
             at /rustc/2c8cc343237b8f7d5a3c3703e3a87f2eb2c54a74/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/2c8cc343237b8f7d5a3c3703e3a87f2eb2c54a74/library/core/src/panicking.rs:64:14
   2: core::result::unwrap_failed
             at /rustc/2c8cc343237b8f7d5a3c3703e3a87f2eb2c54a74/library/core/src/result.rs:1790:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/2c8cc343237b8f7d5a3c3703e3a87f2eb2c54a74/library/core/src/result.rs:1112:23
   4: client::main::{{closure}}
             at ./examples/client.rs:134:5
   5: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at /home/zhaoxu/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/tokio-1.28.1/src/runtime/park.rs:283:63
   6: tokio::runtime::coop::with_budget
             at /home/zhaoxu/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/tokio-1.28.1/src/runtime/coop.rs:107:5
   7: tokio::runtime::coop::budget
             at /home/zhaoxu/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/tokio-1.28.1/src/runtime/coop.rs:73:5
   8: tokio::runtime::park::CachedParkThread::block_on
             at /home/zhaoxu/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/tokio-1.28.1/src/runtime/park.rs:283:31
   9: tokio::runtime::context::BlockingRegionGuard::block_on
             at /home/zhaoxu/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/tokio-1.28.1/src/runtime/context.rs:315:13
  10: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
             at /home/zhaoxu/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/tokio-1.28.1/src/runtime/scheduler/multi_thread/mod.rs:66:9
  11: tokio::runtime::runtime::Runtime::block_on
             at /home/zhaoxu/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/tokio-1.28.1/src/runtime/runtime.rs:304:45
  12: client::main
             at ./examples/client.rs:141:5
  13: core::ops::function::FnOnce::call_once
             at /rustc/2c8cc343237b8f7d5a3c3703e3a87f2eb2c54a74/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

RDMA Soundness Scope

It is impossible to prevent an incorrect remote process from triggering UB in the local process.
Like mmap and /proc/self/mem, such a situation is out of the control of Rust language.

There are two solutions:

  • Document the behavior and remove it from soundness concerns.
    Like rust-lang/rust#97837
  • Put an unsafe function on the way from network connections to active RDMA connections.
    The function means "trust the remote process" while it is impossible to check whether the remote process is correct actually.

Timeout from single side is still unsound because UB may happen when system time goes back.

Related:

support append for memory region

async-rdma memory region need to use with lock_utilities::MappedRwLockReadGuard.

lmr.as_mut_slice().write(&[1_u8; 8])

I want to append(this can be done by wrapping with std::io::cursor) some bytes to mr to avoid useless copy.

But we can't build a cursor with (MappedRwLockReadGuard), like

    let cursor = std::io::Cursor::new(lmr.as_mut_slice());
	cursor.write(&[1_u8; 8])?;

so, i think we can consider providing a MappedRwLockWriteGuard<Cursor<&mut [u8]>>.

If needed, i can open a pr to add it.

insufficient contiguous memory was available to service the allocation request

Hi there,

I'm trying to allocate an obscene amount of memory (2 Gbyte) but my program crashes with

insufficient contiguous memory was available to service the allocation request

this is the relevant code snippet:

            let layout = Layout::from_size_align(request.message_size as usize, 1).unwrap();
            let mut lmr = match rdma.alloc_local_mr(layout){
                Ok(lmr) => lmr,
                Err(e) => {
                    panic!("alloc_local_mr error: {}", e);
                }
            };

request.message_size is set to 2 Gbyte.
My machine has plenty of RAM (256 GB) and I freshly rebooted hoping to have more contiguous memory available. I also enabled CMA (Contiguous Memory Allocator) and reserved 50 Gbyte but that didn't help.
I also tried to concurrently allocate multiple 1 Gbyte memory areas, leading to the same effect.
I don't believe that my machine has <2 Gbyte contiguous memory right after a reboot.
Any ideas?

Thanks

bump rust toolchain

For #111 , the better solution is to bump rust toolchain to a higher version.
And it needs some change to satisfy clippy's checks.

Claiming the name on crates.io?

There have been some "bad actors" who falsely claim names of successful projects on crates.io. It seems like datenlord hasn't really published any work on crates.io yet. This leaves the room for the "bad actors". Would you consider publish this crate or even just claim the name first?

[BUG] mr_allocator related bug when dealing two big lmr with one rdma object

The following code should reproduce the memory problem.

The case is that in server there are two lmr to read data.

The first read is successful. When the first lmr is dropped, jemalloc dalloc is triggered (if the lmr is fairly large), and the EXTENT_TOKEN_MAP will remove the raw_mr item.

However, when creating the second lmr, the lookup_raw_mr function in mr_allocator.rs will get error. There are actually three error situaitions in this function and I have seen all of them (still wondering why...)

Some thing about my system setting:

  • I tried using sudo to run this, still failed
  • I have set my user ulimit to unlimited
  • I use softiwarp on ubuntu 20.04, but I think the bug is only related to mr_allocator
use async_rdma::{LocalMrWriteAccess, RdmaBuilder};
use portpicker::pick_unused_port;
use std::{
    alloc::Layout,
    io::{self, Write},
    net::{Ipv4Addr, SocketAddrV4},
    time::Duration,
};

const SIZE: usize = 44444444;

async fn client(addr: SocketAddrV4) -> io::Result<()> {
    let rdma = RdmaBuilder::default().connect(addr).await?;
    let data = vec![0u8; SIZE];

    // first send
    let layout = Layout::from_size_align(SIZE, 1).unwrap();
    let mut lmr = rdma.alloc_local_mr(layout)?;
    lmr.as_mut_slice().write(&data)?;
    rdma.send_local_mr(lmr).await?;

    // second send
    let layout = Layout::from_size_align(SIZE, 1).unwrap();
    let mut lmr = rdma.alloc_local_mr(layout)?;
    lmr.as_mut_slice().write(&data)?;
    rdma.send_local_mr(lmr).await?;

    // wait for server to read, otherwise this client will early exit
    tokio::time::sleep(Duration::from_secs(5)).await;

    Ok(())
}

#[tokio::main]
async fn server(addr: SocketAddrV4) -> io::Result<()> {
    let rdma = RdmaBuilder::default().listen(addr).await?;

    {
        let layout = Layout::from_size_align(SIZE, 1).unwrap();
        println!("layout: {:?}", layout);
        let mut lmr = rdma.alloc_local_mr(layout)?;
        println!("lmr: {:?}", lmr);
        let rmr = rdma.receive_remote_mr().await?;
        rdma.read(&mut lmr, &rmr).await?;
        println!("rdma read\n-------------");
    }

    // lmr will drop here

    {
        let layout = Layout::from_size_align(SIZE, 1).unwrap();
        println!("layout: {:?}", layout);
        // the memory bug occurs here
        let mut lmr = rdma.alloc_local_mr(layout)?;
        println!("lmr: {:?}", lmr);
        let rmr = rdma.receive_remote_mr().await?;
        rdma.read(&mut lmr, &rmr).await?;
        println!("rdma read\n-------------");
    }

    Ok(())
}

#[tokio::main]
async fn main() {
    let addr = SocketAddrV4::new(Ipv4Addr::new(127, 0, 0, 1), pick_unused_port().unwrap());
    std::thread::spawn(move || server(addr));
    tokio::time::sleep(Duration::from_secs(1)).await;
    client(addr)
        .await
        .map_err(|err| println!("{}", err))
        .unwrap();
}

JEMALLOC_RETAIN defined cause memory allocate failed

When I tried to run client and server of example, I got a message as follow:
thread 'tokio-runtime-worker' panicked at 'internal error: entered unreachable code: dalloc failed, check if JEMALLOC_RETAIN defined. If so, then please undefined it', src/mr_allocator.rs:485:13

I found that JEMALLOC_RETAIN is defined in C file inside tikv-jemalloc-sys, but there is no feature associated with this, so how could I un define it?

Crate examples won't compile.

It seems like something happened to one of the deps, possibly recently?

cargo build --example rpc
Updating crates.io index
error: failed to select a version for the requirement simd-abstraction = "^0.5.0"
candidate versions found which didn't match: 0.7.1
location searched: crates.io index
required by package hex-simd v0.5.0
... which satisfies dependency hex-simd = "^0.5.0" of package async-rdma v0.5.0 (/home/mike/dev/rust/async-rdma)
perhaps a crate was updated and forgotten to be re-vendored?

It looks like all prior versions to simd-abstraction 0.7.1 have been yanked from crates.io...?

Failed to run example on ubuntu server 20.04

I tried to run the example as show in repo index page like:

cargo run --example rpc
cargo run --example rpc
    Finished dev [unoptimized + debuginfo] target(s) in 0.09s
     Running `target/debug/examples/rpc`
rpc server started
2023-08-21T15:50:53.641480Z ERROR async_rdma::error_utilities: OS error Os { code: 22, kind: InvalidInput, message: "Invalid argument" }
2023-08-21T15:50:53.641514Z ERROR async_rdma::error_utilities: OS error Os { code: 22, kind: InvalidInput, message: "Invalid argument" }
Invalid argument (os error 22)
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ()', examples/rpc.rs:147:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 22, kind: InvalidInput, message: "Invalid argument" }', examples/rpc.rs:32:71

And I also run this example in another ubuntu 22.04 machine, and it succeed!

[Feature] How to register Memory Region on specific memory address?

The API to create MR, i.e. alloc_local_mr(layout) function, only has the option to allocate new memory in an ad-hoc way (use malloc inside the alloc_local_mr function).

Is it possible that I register an existing memory block as MR. As I want to transfer a large block of memory and don't want it to be copied which may cause performance issue. Many thanks in advance!

Support working with existing apps

The current implementation defines its own protocol to establish connections and send/receive messages. This brings a limitation that both client and server must use this library to communicate. We can not use this library to build a client of an existing rdma server. So I have 2 suggestions that might help:

  1. Support setting up connections through RDMA CM.
  2. Add lower-level APIs to send/receive raw messages.

How to forget memory with the jemalloc strategy.

Hello, I was using PyO3 to bind this rust code to python and is now struggling with the memory ownership problem (more specifically, double free).

When I try to pass an lmr to python, I do std::mem::forget(lmr) to make rust forget about the memory, as the memory is already somehow transfered to python by some unsafe pointer operations.

However, according to my reasoning, I think the jemalloc strategy does not give up this memory even lmr.drop() is not called. Is this expected? Is there way to forget this memory in jemalloc strategy?

P.S. raw strategy works fine and no double free occurs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.