Return option type while getting mr slice

DatenLord

DatenLord is a next-generation cloud-native distributed storage platform, which aims to meet the performance-critical storage needs from next-generation cloud-native applications, such as microservice, serverless, AI, etc. On one hand, DatenLord is designed to be a cloud-native storage system, which itself is distributed, fault-tolerant, and graceful upgrade. These cloud-native features make DatenLord easy to use and easy to maintain. On the other hand, DatenLord is designed as an application-orientated storage system, in that DatenLord is optimized for many performance-critical scenarios, such as databases, AI machine learning, big data. Meanwhile, DatenLord provides high-performance storage service for containers, which facilitates stateful applications running on top of Kubernetes (K8S). The high performance of DatenLord is achieved by leveraging the most recent technology revolution in hardware and software, such as NVMe, non-volatile memory, asynchronous programming, and the native Linux asynchronous IO support.

Why DatenLord?

Why do we build DatenLord? The reason is two-fold:

Firstly, the recent computer hardware architecture revolution stimulates storage software refractory. The storage related functionalities inside Linux kernel haven't changed much in recent 10 years, whenas hard-disk drive (HDD) was the main storage device. Nowadays, solid-state drive (SSD) becomes the mainstream, not even mention the most advanced SSD, NVMe and non-volatile memory. The performance of SSD is hundreds of times faster than HDD, in that the HDD latency is around 1~10 ms, whereas the SSD latency is around 50–150 μs, the NVMe latency is around 25 μs, and the non-volatile memory latency is 350 ns. With the performance revolution of storage devices, traditional blocking-style/synchronous IO in Linux kernel becomes very inefficient, and non-blocking-style/asynchronous IO is much more applicable. The Linux kernel community already realized that, and recently Linux kernel has proposed native-asynchronous IO mechanism, io_uring, to improve IO performance. Beside blocking-style/synchronous IO, the context switch overhead in Linux kernel becomes no longer negligible w.r.t. SSD latency. Many modern programming languages have proposed asynchronous programming, green thread or coroutine to manage asynchronous IO tasks in user space, in order to avoid context switch overhead introduced by blocking IO. Therefore we think it’s time to build a next-generation storage system that takes advantage of the storage performance revolution as far as possible, by leveraging non-blocking/asynchronous IO, asynchronous programming, NVMe, and even non-volatile memory, etc.
Secondly, most distributed/cloud-native systems are computing and storage isolated, that computing tasks/applications and storage systems are of dedicated clusters, respectively. This isolated architecture is best to reduce maintenance, that it decouples the maintenance tasks of computing clusters and storage clusters into separate ones, such as upgrade, expansion, migration of each cluster respectively, which is much simpler than of coupled clusters. Nowadays, however, applications are dealing with much larger datasets than ever before. One notorious example is that an AI training job takes one hour to load data whereas the training job itself finishes in only 45 minutes. Therefore, isolating computing and storage makes IO very inefficient, as transferring data between applications and storage systems via network takes quite a lot of time. Further, with the isolated architecture, applications have to be aware of the different data location, and the varying access cost due to the difference of data location, network distance, etc. DatenLord tackles the IO performance issue of isolated architecture in a novel way, which abstracts the heterogeneous storage details and makes the difference of data location, access cost, etc, transparent to applications. Furthermore, with DatenLord, applications can assume all the data to be accessed are local, and DatenLord will access the data on behalf of applications. Besides, DatenLord can help K8S to schedule jobs close to cached data, since DatenLord knows the exact location of all cached data. By doing so, applications are greatly simplified w.r.t. to data access, and DatenLord can leverage local cache, neighbor cache, and remote cache to speed up data access, so as to boost performance.

Target scenarios

The main scenario of DatenLord is to facilitate high availability across multi-cloud, hybrid-cloud, multiple data centers, etc. Concretely, there are many online business providers whose business is too important to afford any downtime. To achieve high availability, the service providers have to leverage multi-cloud, hybrid-cloud, and multiple data centers to hopefully avoid single point failure of each single cloud or data center, by deploying applications and services across multiple clouds or data centers. It's relatively easier to deploy applications and services to multiple clouds and data centers, but it's much harder to duplicate all data to all clouds or all data centers in a timely manner, due to the huge data size. If data is not equally available across multiple clouds or data centers, the online business might still suffer from single point failure of a cloud or a data center, because data unavailability resulted from a cloud or a data center failure.

DatenLord can alleviate data unavailable of cloud or data center failure by caching data to multiple layers, such as local cache, neighbor cache, remote cache, etc. Although the total data size is huge, the hot data involved in online business is usually of limited size, which is called data locality. DatenLord leverages data locality and builds a set of large scale distributed and automatic cache layers to buffer hot data in a smart manner. The benefit of DatenLord is two-fold:

DatenLord is transparent to applications, namely DatenLord does not need any modification to applications;
DatenLord is high performance, that it automatically caches data by means of the data hotness, and it's performance is achieved by applying different caching strategies according to target applications. For example, least recent use (LRU) caching strategy for some kind of random access, most recent use (MRU) caching strategy for some kind of sequential access, etc.

Architecture

Single Data Center

Multiple Data Centers and Hybrid Cloud

DatenLord provides 3 kinds of user interfaces: KV interface, S3 interface and file interface. The backend storage is supported by the underlying distributed cache layer which is strong consistent. The strong consistency is guaranteed by the metadata management module which is built on high performance consensus protocol. The persistence storage layer can be local disk or S3 storage. For the network, RDMA is used to provide high throughput and low latency networks. If RDMA is not supported, TCP is an alternative option. For the multiple data center and hybrid clouds scenario, there will be a dedicated metadata server which supports metadata requests within the same data center. While in the same data center scenario, the metadata module can run on the same machine as the cache node. The network between data centers and public clouds are managed by a private network to guarantee high quality data transfer.

Quick Start

Currently DatenLord has been built as Docker images and can be deployed via K8S.

To deploy DatenLord via K8S, just simply run:

sed -e 's/e2e_test/latest/g' scripts/setup/datenlord.yaml > datenlord-deploy.yaml
kubectl apply -f datenlord-deploy.yaml

To use DatenLord, just define PVC using DatenLord Storage Class, and then deploy a Pod using this PVC:

cat <<EOF >datenlord-demo.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-datenlord-test
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 100Mi
  storageClassName: csi-datenlord-sc

---
apiVersion: v1
kind: Pod
metadata:
  name: mysql-datenlord-test
spec:
    containers:
    - name: mysql
      image: mysql
      env:
      - name: MYSQL_ROOT_PASSWORD
        value: "rootpasswd"
      volumeMounts:
      - mountPath: /var/lib/mysql
        name: data
        subPath: mysql
    volumes:
    - name: data
      persistentVolumeClaim:
        claimName: pvc-datenlord-test
EOF

kubectl apply -f datenlord-demo.yaml

DatenLord provides a customized scheduler which implements K8S scheduler extender. The scheduler will try to schedule a pod to the node that has the volume that it requests. To use the scheduler, add schedulerName: datenlord-scheduler to the spec of your pod. Caveat: dangling docker image may cause failed to parse request error. Doing docker image prune on each K8S node is a way to fix it.

It may need to install snapshot CRD and controller on K8S, if used K8S CSI snapshot feature:

kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/deploy/kubernetes/snapshot-controller/rbac-snapshot-controller.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/deploy/kubernetes/snapshot-controller/setup-snapshot-controller.yaml

Monitoring

Datenlord monitoring guideline is in datenlord monitoring. We provide both YAML and Helm method to deploy the monitoring system.

To use YAML method, just run

sh ./scripts/setup/datenlord-monitor-deploy.sh

To use Helm method, run

sh ./scripts/setup/datenlord-monitor-deploy.sh helm

Performance Test

Performance test is done by fio and fio-plot is used to plot performance histograms.

To run performance test,

sudo apt-get update
sudo apt-get install -y fio python3-pip
sudo pip3 install matplotlib numpy fio-plot
sh ./scripts/perf/fio-perf-test.sh TEST_DIR

Four histograms will be generated.

Random read IOPS and latency for different block sizes
Random write IOPS and latency for different block sizes
Random read IOPS and latency for different read thread numbers with 4k block size
Random write IOPS and latency for different write thread numbers with 4k block size

Performance test is added to GitHub Action(cron.yml) and performance report is generated and archived as artifacts(Example) for every four hours.

How to Contribute

Anyone interested in DatenLord is welcomed to contribute.

Coding Style

Please follow the code style. Meanwhile, DatenLord adopts very strict clippy linting, please fix every clippy warning before submit your PR. Also please make sure all CI tests are passed.

Continuous Integration (CI)

The CI of DatenLord leverages GitHub Action. There are two CI flows for DatenLord, One is for Rust cargo test, clippy lints, and standard filesystem E2E checks; The other is for CSI related tests, such as CSI sanity test and CSI E2E test.

The CSI E2E test setup is a bit complex, its action script cron.yml is quite long, so let's explain it in detail:

First, it sets up a test K8S cluster with one master node and three slave nodes, using Kubernetes in Docker (KinD);
Second, CSI E2E test requires no-password SSH login to each K8S slave node, since it might run some commands to prepare test environment or verify test result, so it has to setup SSH key to each Docker container of KinD slave nodes;
Third, it builds DatenLord container images and loads to KinD, which is a caveat of KinD, in that KinD puts K8S nodes inside Docker containers, thus kubelet cannot reach any resource of local host, and KinD provides load operation to make the container images from local host visible to kubelet;
At last, it deploys DatenLord to the test K8S cluster, then downloads pre-build K8S E2E binary, runs in parallel by involking ginkgo -p, and only selects External.Storage related CSI E2E testcases to run.

Sub-Projects

DatenLord has several related sub-projects, mostly working in progress, listed alphabetically:

async-fuse Native async Rust library for FUSE;
async-rdma Async and safe Rust library for RDMA;
etcd-client Async etcd client SDK in Rust;
lockfree-cuckoohash Lock-free hashmap in Rust;
LordOS Pure containerized Linux distribution;
ring-io Async and safe Rust library for io_uring;
s3-server S3 server in Rust.

Road Map

0.1 Refactor async fuse lib to provide clear async APIs, which is used by the datenlord filesystem.
0.2 Support all Fuse APIs in the datenlord fs.
0.3 Make fuse lib fully asynchronous. Switch async fuse lib's device communication channel from blocking I/O to io_uring.
0.4 Complete K8S integration test.
0.5 Support RDMA.
1.0 Complete Tensorflow K8S integration and finish performance comparison with raw fs.

	/// Get the memory region as slice
	#[inline]
	fn as_slice(&self) -> &[u8] {
	unsafe { slice::from_raw_parts(self.as_ptr(), self.length()) }
	}

	/// Get the memory region as mut slice
	#[inline]
	fn as_mut_slice(&mut self) -> &mut [u8] {
	unsafe { slice::from_raw_parts_mut(self.as_mut_ptr(), self.length()) }
	}

	/// Poll one work completion from CQ
	pub(crate) fn poll_single(&self) -> io::Result<WorkCompletion> {
	let polled = self.poll(1)?;
	polled
	.into_iter()
	.next()
	.ok_or_else(\|\| io::Error::new(io::ErrorKind::WouldBlock, ""))
	}

	/// Creat a random u64 id.
	///
	/// Both `WorkRequetId` and `AgentRequestId` depend on this, so make this fn independent.
	/// To avoid id duplication, this fn concatenates `SystemTime` and random number into a U64.
	/// The syscall may have some overhead, which can be improved later by balancing the pros and cons.
	pub(crate) fn random_u64() -> u64 {
	let start = SystemTime::now();
	// No time can be earlier than Unix Epoch
	#[allow(clippy::unwrap_used)]
	let since_the_epoch = start.duration_since(UNIX_EPOCH).unwrap();
	let time = since_the_epoch.subsec_micros();
	let rand = rand::thread_rng().gen::<u32>();
	let left: u64 = time.into();
	let right: u64 = rand.into();
	left.overflow_shl(32) \| right
	}

	/// Get a local mr slice
	#[inline]
	pub fn get(&self, i: Range<usize>) -> io::Result<LocalMrSlice> {
	// SAFETY: `self` is checked to be valid and in bounds above.
	if i.start >= i.end \|\| i.end > self.length() {
	Err(io::Error::new(io::ErrorKind::Other, "wrong range of lmr"))
	} else {
	Ok(LocalMrSlice::new(
	self,
	self.addr().overflow_add(i.start),
	i.len(),
	))
	}
	}

	/// Get a mutable local mr slice
	#[inline]
	pub fn get_mut(&mut self, i: Range<usize>) -> io::Result<LocalMrSliceMut> {
	// SAFETY: `self` is checked to be valid and in bounds above.
	if i.start >= i.end \|\| i.end > self.length() {
	Err(io::Error::new(io::ErrorKind::Other, "wrong range of lmr"))
	} else {
	Ok(LocalMrSliceMut::new(
	self,
	self.addr().overflow_add(i.start),
	i.len(),
	))
	}
	}

	/// Get a remote mr slice
	#[inline]
	pub fn get(&self, i: Range<usize>) -> io::Result<RemoteMrSlice> {
	// SAFETY: `self` is checked to be valid and in bounds above.
	if i.start >= i.end \|\| i.end > self.length() {
	Err(io::Error::new(io::ErrorKind::Other, "wrong range of rmr"))
	} else {
	let slice_token = MrToken {
	addr: self.addr().overflow_add(i.start),
	len: i.end.overflow_sub(i.start),
	rkey: self.rkey(),
	};
	Ok(RemoteMrSlice::new_from_token(self, slice_token))
	}
	}

	/// Get a mutable remote mr slice
	#[inline]
	pub fn get_mut(&mut self, i: Range<usize>) -> io::Result<RemoteMrSliceMut> {
	// SAFETY: `self` is checked to be valid and in bounds above.
	if i.start >= i.end \|\| i.end > self.length() {
	Err(io::Error::new(io::ErrorKind::Other, "wrong range of rmr"))
	} else {
	let slice_token = MrToken {
	addr: self.addr().overflow_add(i.start),
	len: i.end.overflow_sub(i.start),
	rkey: self.rkey(),
	};
	Ok(RemoteMrSliceMut::new_from_token(self, slice_token))
	}
	}

	let end = (start.overflow_add(self.max_sr_data_len)).min(lm_len);
	let kind = RequestKind::SendData(SendDataRequest {
	len: end.overflow_sub(start),
	});

	/// Server can't aware of rdma `read` or `write`, so we need sync with client
	/// before client `read` and after `write`.
	async fn sync_with_client(rdma: &Rdma) {
	let mut lmr_sync = rdma
	.alloc_local_mr(Layout::new::<Request>())
	.map_err(\|err\| println!("{}", &err))
	.unwrap();
	//write data to lmr
	unsafe { (lmr_sync.as_mut_ptr() as mut Request) = Request::Sync };
	rdma.send(&lmr_sync)
	.await
	.map_err(\|err\| println!("{}", &err))
	.unwrap();
	}

datenlord / async-rdma Goto Github PK

async-rdma's Introduction

DatenLord

Why DatenLord?

Target scenarios

Architecture

Single Data Center

Multiple Data Centers and Hybrid Cloud

Quick Start

Monitoring

Performance Test

How to Contribute

Coding Style

Continuous Integration (CI)

Sub-Projects

Road Map

async-rdma's People

Stargazers

Watchers

Forkers

async-rdma's Issues

Env

Description

Recommend Projects

Recommend Topics

Recommend Org