v6d-io / v6d Goto Github PK

vineyard (v6d): an in-memory immutable data manager. (Project under CNCF, TAG-Storage)

License: Apache License 2.0

CMake 2.47% C++ 51.68% Python 17.31% Shell 0.57% C 0.09% Smarty 0.05% Makefile 0.58% Dockerfile 0.09% Go 12.33% Rust 4.25% Java 9.96% Cuda 0.05% Scala 0.53% HiveQL 0.06%

distributed-comp distributed-systems in-memory-storage big-data-analytics graph-analytics shared-memory distributed cloud-native cncf sig-storage

v6d's Introduction

an in-memory immutable data manager

Vineyard (v6d) is an innovative in-memory immutable data manager that offers out-of-the-box high-level abstractions and zero-copy in-memory sharing for distributed data in various big data tasks, such as graph analytics (e.g., GraphScope), numerical computing (e.g., Mars), and machine learning.

Vineyard is a CNCF sandbox project and indeed made successful by its community.

Overview
Features of vineyard
Getting started with Vineyard
Deploying on Kubernetes
Frequently asked questions
Getting involved in our community
Third-party dependencies

What is vineyard

Vineyard is specifically designed to facilitate zero-copy data sharing among big data systems. To illustrate this, let's consider a typical machine learning task of time series prediction with LSTM. This task can be broken down into several steps:

First, we read the data from the file system as a pandas.DataFrame.
Next, we apply various preprocessing tasks, such as eliminating null values, to the dataframe.
Once the data is preprocessed, we define the model and train it on the processed dataframe using PyTorch.
Finally, we evaluate the performance of the model.

In a single-machine environment, pandas and PyTorch, despite being two distinct systems designed for different tasks, can efficiently share data with minimal overhead. This is achieved through an end-to-end process within a single Python script.

What if the input data is too large to be processed on a single machine?

As depicted on the left side of the figure, a common approach is to store the data as tables in a distributed file system (e.g., HDFS) and replace pandas with ETL processes using SQL over a big data system such as Hive and Spark. To share the data with PyTorch, the intermediate results are typically saved back as tables on HDFS. However, this can introduce challenges for developers.

For the same task, users must program for multiple systems (SQL & Python).
Data can be polymorphic. Non-relational data, such as tensors, dataframes, and graphs/networks (in GraphScope) are becoming increasingly common. Tables and SQL may not be the most efficient way to store, exchange, or process them. Transforming the data from/to "tables" between different systems can result in significant overhead.
Saving/loading the data to/from external storage incurs substantial memory-copies and IO costs.

Vineyard addresses these issues by providing:

In-memory distributed data sharing in a zero-copy fashion to avoid introducing additional I/O costs by leveraging a shared memory manager derived from plasma.
Built-in out-of-the-box high-level abstractions to share distributed data with complex structures (e.g., distributed graphs) with minimal extra development cost, while eliminating transformation costs.

As depicted on the right side of the above figure, we demonstrate how to integrate vineyard to address the task in a big data context.

First, we utilize Mars (a tensor-based unified framework for large-scale data computation that scales Numpy, Pandas, and Scikit-learn) to preprocess the raw data, similar to the single-machine solution, and store the preprocessed dataframe in vineyard.

single

data_csv = pd.read_csv('./data.csv', usecols=[1])

distributed

import mars.dataframe as md
dataset = md.read_csv('hdfs://server/data_full', usecols=[1])
# after preprocessing, save the dataset to vineyard
vineyard_distributed_tensor_id = dataset.to_vineyard()

Then, we modify the training phase to get the preprocessed data from vineyard. Here vineyard makes the sharing of distributed data between Mars and PyTorch just like a local variable in the single machine solution.

single	data_X, data_Y = create_dataset(dataset)
distributed	client = vineyard.connect(vineyard_ipc_socket) dataset = client.get(vineyard_distributed_tensor_id).local_partition() data_X, data_Y = create_dataset(dataset)

Finally, we execute the training phase in a distributed manner across the cluster.

From this example, it is evident that with vineyard, the task in the big data context can be addressed with only minor adjustments to the single-machine solution. Compared to existing approaches, vineyard effectively eliminates I/O and transformation overheads.

Features

Efficient In-Memory Immutable Data Sharing

Vineyard serves as an in-memory immutable data manager, enabling efficient data sharing across different systems via shared memory without additional overheads. By eliminating serialization/deserialization and IO costs during data exchange between systems, Vineyard significantly improves performance.

Out-of-the-Box High-Level Data Abstractions

Computation frameworks often have their own data abstractions for high-level concepts. For example, tensors can be represented as torch.tensor, tf.Tensor, mxnet.ndarray, etc. Moreover, every graph processing engine has its unique graph structure representation.

The diversity of data abstractions complicates data sharing. Vineyard addresses this issue by providing out-of-the-box high-level data abstractions over in-memory blobs, using hierarchical metadata to describe objects. Various computation systems can leverage these built-in high-level data abstractions to exchange data with other systems in a computation pipeline concisely and efficiently.

Stream Pipelining for Enhanced Performance

A computation doesn't need to wait for all preceding results to arrive before starting its work. Vineyard provides a stream as a special kind of immutable data for pipelining scenarios. The preceding job can write immutable data chunk by chunk to Vineyard while maintaining data structure semantics. The successor job reads shared-memory chunks from Vineyard's stream without extra copy costs and triggers its work. This overlapping reduces the overall processing time and memory consumption.

Versatile Drivers for Common Tasks

Many big data analytical tasks involve numerous boilerplate routines that are unrelated to the computation itself, such as various IO adapters, data partition strategies, and migration jobs. Since data structure abstractions usually differ between systems, these routines cannot be easily reused.

Vineyard provides common manipulation routines for immutable data as drivers. In addition to sharing high-level data abstractions, Vineyard extends the capability of data structures with drivers, enabling out-of-the-box reusable routines for the boilerplate parts in computation jobs.

Try Vineyard

Vineyard is available as a python package and can be effortlessly installed using pip:

pip3 install vineyard

For comprehensive and up-to-date documentation, please visit https://v6d.io.

If you wish to build vineyard from source, please consult the Installation guide. For instructions on building and running unittests locally, refer to the Contributing section.

After installation, you can initiate a vineyard instance using the following command:

python3 -m vineyard

For further details on connecting to a locally deployed vineyard instance, please explore the Getting Started guide.

Deploying on Kubernetes

Vineyard is designed to efficiently share immutable data between different workloads, making it a natural fit for cloud-native computing. By embracing cloud-native big data processing and Kubernetes, Vineyard enables efficient distributed data sharing in cloud-native environments while leveraging the scaling and scheduling capabilities of Kubernetes.

To effectively manage all components of Vineyard within a Kubernetes cluster, we have developed the Vineyard Operator. For more information, please refer to the Vineyard Operator documentation.

FAQ

Vineyard shares many similarities with other open-source projects, yet it also has distinct features. We often receive the following questions about Vineyard:

Q: Can clients access the data while the stream is being filled?

Sharing one piece of data among multiple clients is a target scenario for Vineyard, as the data stored in Vineyard is immutable. Multiple clients can safely consume the same piece of data through memory sharing, without incurring extra costs or additional memory usage from copying data back and forth.
Q: How does Vineyard avoid serialization/deserialization between systems in different languages?

Vineyard provides high-level data abstractions (e.g., ndarrays, dataframes) that can be naturally shared between different processes, eliminating the need for serialization and deserialization between systems in different languages.
. . . . . .

For more detailed information, please refer to our FAQ page.

Get Involved

Join the CNCF Slack and participate in the #vineyard channel for discussions and collaboration.
Familiarize yourself with our contribution guide to understand the process of contributing to vineyard.
If you encounter any bugs or issues, please report them by submitting a GitHub issue or engage in a conversation on Github discussion.
We welcome and appreciate your contributions! Submit them using pull requests.

Thank you in advance for your valuable contributions to vineyard!

Publications

Wenyuan Yu, Tao He, Lei Wang, Ke Meng, Ye Cao, Diwen Zhu, Sanhong Li, Jingren Zhou. Vineyard: Optimizing Data Sharing in Data-Intensive Analytics. ACM SIG Conference on Management of Data (SIGMOD), industry, 2023. .

If you use this software, please cite our paper using the following metadata:

@article{yu2023vineyard,
   author = {Yu, Wenyuan and He, Tao and Wang, Lei and Meng, Ke and Cao, Ye and Zhu, Diwen and Li, Sanhong and Zhou, Jingren},
   title = {Vineyard: Optimizing Data Sharing in Data-Intensive Analytics},
   year = {2023},
   issue_date = {June 2023},
   publisher = {Association for Computing Machinery},
   address = {New York, NY, USA},
   volume = {1},
   number = {2},
   url = {https://doi.org/10.1145/3589780},
   doi = {10.1145/3589780},
   journal = {Proc. ACM Manag. Data},
   month = {jun},
   articleno = {200},
   numpages = {27},
   keywords = {data sharing, in-memory object store}
}

Acknowledgements

We thank the following excellent open-source projects:

apache-arrow, a cross-language development platform for in-memory analytics.
boost-leaf, a C++ lightweight error augmentation framework.
cityhash, CityHash, a family of hash functions for strings.
dlmalloc, Doug Lea's memory allocator.
etcd-cpp-apiv3, a C++ API for etcd's v3 client API.
flat_hash_map, an efficient hashmap implementation.
gulrak/filesystem, an implementation of C++17 std::filesystem.
libcuckoo, libcuckoo, a high-performance, concurrent hash table.
mimalloc, a general purpose allocator with excellent performance characteristics.
nlohmann/json, a json library for modern c++.
pybind11, a library for seamless operability between C++11 and Python.
s3fs, a library provide a convenient Python filesystem interface for S3.
skywalking-infra-e2e A generation End-to-End Testing framework.
skywalking-swck A kubernetes operator for the Apache Skywalking.
wyhash, C++ wrapper around wyhash and wyrand.
BBHash, a fast, minimal-memory perfect hash function.
rax, an ANSI C radix tree implementation.
MurmurHash3, a fast non-cryptographic hash function.

License

Vineyard is distributed under Apache License 2.0. Please note that third-party libraries may not have the same license as vineyard.

v6d's People

Contributors

Stargazers

Watchers

Forkers

sighingnow wenyuanyu andydiwenzhu lidongze0629 pwrliang qinxuye luoxiaojian luckyplusten1024 xiaming9880 tomzhang yetingsky acezen yaobaiwei tanjiuwang lichnak jia-wang-11 zarca tankzhouqiang zeta1999 westfly btbujiangjun xhcom-ui b-xiang jacco-su wokaio hadryan goodoid neuyilan org-mars cxopensource siyuan0322 yecol rohan-cod ciusji chaitravi-ce chuanlei jian-he peilii sijie-l univerone linlih xieydd ldd91 zhanglei1949 beiluo97 nathanawmk andtheysay xterry moteesh-reddy sunkickr zhaodiankui thisiskicker fytzzh tempbottle superad mengke-mk sakkshm26 blecx mitu626 merlintang zjuytw waruto210 seanshi-prog tudb-labs liusitan shihao-thx bobhan1 yuping322 dzhiwei vegetableysm dashanji wuyueandrew cswater awesomedatatool husimplicity 2022h1030042g kuangyunsheng watoliu kou liudyboy tisonkun jackyyangpassion h20zhang mistobaan songqing pgwhalen xxxx001 ziy1-tan kkxiaotikk whitalk zhuyi1159 lidh15 yanwei-fabarta erdal-pb sighingsnow thisisobate jingchu000 hipoooop web-logs2 brunoscaglione

v6d's Issues

Implements signature for vineyard object

[BUG] Vineyard's builders and resolvers for python types doesn't fit very well with python itself (and numpy, pandas)

Describe your problem

Vineyard python client crashes when get empty np.array from vineyard (#137).

In [1]: import vineyard

In [2]: cl = vineyard.connect("/tmp/vineyard.sock")

In [3]: cl.put(np.array(()))
Out[3]: o00019a42596292fc

In [4]: cl.get(cl.put(np.array(())))
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0108 12:57:40.512176 342670784 object_factory.cc:26] Failed to create an instance due to the unknown typename: vineyard::Tensor<float64>
[1]    14403 segmentation fault  ipython

vary-sized tuple (#137)
numpy ndarray with OBJECT type (hard to fixes).
sparse arrays are stored in a continous flatten manner
tensor order (easy to fixes) (#137)
string type.
- fixed-sized string / fixed-sized binary type
- vary-sized string / python object

Additional context

N/A

Enhance the CI: check compatibility on various Linux/Mac systems, with various versions of dependencies.

Describe your problem

We need to ensure compatibility with the following settings:

OS:

Ubuntu 18.04
Ubuntu 20.04
MacOS

Boost:

1.65 (apt-get installed on Ubuntu 18.04)
1.71 (apt-get installed on Ubuntu 20.04)
latest (brew installed on MacOS)

apache-arrow:

1.0.1
2.0.0

Additional context

N/A

Scalability issues of pandas_dataframe_builder and pyarrow table_builder

I have a pyarrow table with columns composed of 21036 chunks (21032413 rows). Storing the table or the equivalent pandas dataframe in vineyard does not succeed in reasonable amount of time (I waited a few minutes) but hangs in record_batch_builder (and the equivalent function for pandas respectively).

That's unlike the pyarrow plasma implementation, which is implemented in C++ and just takes a few seconds:
https://github.com/apache/arrow/blob/995abdc02fed412bbd947fe41a0765036dbbe820/cpp/src/arrow/python/serialize.cc#L588-L599

Do you intend to match performance with pyarrow for large tables? Or is this out of scope for the project?

Return ErrorCode with namespace in macros

Describe your problem

I occurred an error on macOS when building graphscope. error messages are shown as below:

/Users/yecol/git-repo/graphscope-edu/analytical_engine/core/fragment/arrow_projected_fragment.h:891:7: error: use of undeclared identifier
      'ErrorCode'; did you mean 'vineyard::ErrorCode'?
      ARROW_OK_OR_RAISE(begins_builder.Append(ret.first));
      ^
/usr/local/include/vineyard/graph/utils/error.h:220:23: note: expanded from macro 'ARROW_OK_OR_RAISE'
      RETURN_GS_ERROR(ErrorCode::kArrowError, (status_name).ToString()); \
                      ^
/usr/local/include/vineyard/graph/utils/error.h:31:12: note: 'vineyard::ErrorCode' declared here
enum class ErrorCode {
           ^
In file included from /Users/yecol/git-repo/graphscope-edu/analytical_engine/core/grape_instance.cc:20:
In file included from /Users/yecol/git-repo/graphscope-edu/analytical_engine/core/context/tensor_context.h:37:
/Users/yecol/git-repo/graphscope-edu/analytical_engine/core/fragment/arrow_projected_fragment.h:892:7: error: use of undeclared identifier
      'ErrorCode'; did you mean 'vineyard::ErrorCode'?
      ARROW_OK_OR_RAISE(ends_builder.Append(ret.second));
      ^
/usr/local/include/vineyard/graph/utils/error.h:220:23: note: expanded from macro 'ARROW_OK_OR_RAISE'
      RETURN_GS_ERROR(ErrorCode::kArrowError, (status_name).ToString()); \

I suggest we return the errorcode enum class together with namespace.
Since libvineyard may be used by another project which is on the outside of namespace vineyard. Just like graphscope.

btw, I also suggest an inprovement on find_vineyard, give a hint like other libraries:

NOT_Found libvineyard
# or 
Found libvineyard, (include: /usr/local/include, library: /usr/local/lib/libvineyard.dylib)

Descirption of the problem

If it is a bug report, to help us reproducing this bug, please provide information below:

Your Operation System version (uname -a): macOS
The version of libvineyard you use (vineyard.__version__): latest
Versions of crucial packages, such as mpi:
Full stack of the error: aforementioned
Minimized code to reproduce the error: cmake && make -j4 on macOS

If it is a feature request, please provides a clear and concise description of what you want to happen:

What is the problem:
The behaviour that you expect to work:

Additional context

Add any other context about the problem here.

Object serialization/deserialization: dump an object to external IO, and restore it back

Describe your problem

From alibaba/GraphScope#53

Additional context

N/A.

Eliminate arrow from libvineyard_client by redesign buffer in blob

Make the dependency clean to avoid potential conflicts from ABI problems of Arrow

Enabling limited RPC data access functionality for streams.

Describe your problem

To handle cases like "opening a numpy ndarray as a stream".

I must admit that it is controversal and counterintuitive to add such feature to vineyardd. I would like to know how others think about that.

Additional context

N/A.

Support Windows: MSVC on windows and WSL.

Describe your problem

The WSL2 should be trivial to support since it is a Linux running inside the virutal machine. But natively supporting Windows looks chanllanging:

we may need to make upstream dependecies (etcd-cpp-apiv3, etc.) to support windows first.
sharing a file descriptor over IPC socket may require a totally different codepath on Windows.

But this does worth a try.

Additional context

Such effort could be backported to apache-arrow as well.

Remove OSS CPP SDK related dependencies.

Both in submodules and CmakeLists, as it's no longer needed and its size is large.

Investigate the dataset-lifecycle-framework project

Describe your problem

The dataset-lifecycle-framework project abstracts data sources on S3 and NFS as CRD Dataset. By mapping the dataset into PersistentVolumeClaims and ConfigMaps, the dataset could be refered in Pods.

The dataset-lifecycle-framework project also has an operator, and use "labels" to indicate the dataset requirements in Pod. The operator is responsible for mounting the dataset properly. Just like we do in vineyard.

Vineyard use a similar syntax in "labels" of Pods to let the controller know which object the pod will require. Beyond that, vineyard operator considers the data locality and works as a scheduler plugin.

Additional context

N/A

Enhancement: better support for memory hierarchy

Describe your problem

Vineyard addresses sharing immutable data in-memory, but sometimes checkpoint or swap data to storage devices do make sense.

Vineyard already support object serailization/deserailzation in #164, but that is not enough.

(WIP) more details of the design will be added later.

Additional context

N/A

vineyardd: error message is misleading when encountered permission problem

Describe your problem

vineyardd works with default socket at /var/run/vineyard.sock, which, usually requires a "root" permission. However when the file exists, vineyardd cannot report permission errors properly.

Vineyard's IPC client also suffers the same problem.

Revise global object based on signature.

Describe your problem

Global object doesn't allow nestification.
Global object holds the signature of local object, rather than object ID, and do signature -> id resolution when getting object.

Additional context

This task depends on #113 .

Investigate the roadmap of integration with fluid project

Describe your problem

The fluid project manages a data cache layer (provided by alluxio or jindofs) for on-top bigdata applications. Fluid abstract the data caching services as a Dataset, and mount it to worker pods via a CSI driver. Besides, based the affinity of volumes, fluid also considering the data locality when scheduling.

Vineyard also consider the data locality when scheduling. Currently vineyard abstracts objects as CRDs to make them visible for kubernetes. And volumes also fit vineyard's design well. Integrating with fluid would benifits vineyard to make the sheduler plugin and vineyard-operator more lightweight.

To be abbreviated, vineyard could work as a fluid runtime. Things need to be done to work together with fluid are as follows (if
I understand it correctly):

allowing mount an vineyard object as a volume. Vineyard server may need to provided a seperated IPC service for that.
implements a controller for vineyard in fluid (a fluid runtime).

Additional context

N/A

Helm integration of vineyardd

How to deploy a DeamonSet using helm.

modify current gsa loader uri to fit network::uri

Current uri is bad for network::uri when delimiter=|

Migration of stream: a "pipe" driver that read stream from one vineyardd and create a new stream at another vineyardd

Enhancement: vineyard SDKs for other languages

Describe your problem

Beyond C++ and Python, there are also many computation engines are written in other languages, e.g., Java/Scala, Rust and go. To have a well support for such languages, we need to privides SDKs.

Java ( #227 )
Rust ( #273 )
Go ( #228 )

Additional context

The Java SDK has been required in discussion by end users.

support series structure of pandas

cmake: required dependency on grapelite

https://github.com/alibaba/libvineyard/blob/053fb4119ac00490d7bf0ce7d2ec1c5c9a7ddd7e/modules/basic/CMakeLists.txt#L5

grapelite is listed as required dependency, even when building with -DBUILD_VINEYARD_GRAPH=0

Redesign and reimplement the registry mechanism for drivers

Describe your problem

Current situation doesn't work when the client is not in python context.

Additional context

TBF

Invistigate the impact of kubernetes removes the docker support

We need to test if the shared memory/IPC between pods are supported by other OCI container runtime (e.g., containerd)

Distribute IO drivers registry (aka. the stream.py, ssh.sh, kube_ssh.sh) as a installable python package

Describe your problem

As described in the title.

Additional context

To make IO stuffs works for graphscope.

Accurate memory allocation failure diagnostic

Vineyard's memory allocator derived from plasma, and also suffers from issues in plasma store server. When specifying a large --size value (exceeds the available size of /dev/shm), creating blobs won't fail, but any touching (aka. read/write) on the memory will trigger a "SIGBUG", i.e., bus error.

Such a signal is hard for users to catch, and usually leads to "crash" of the client program, which is quite bad.

Implicitly/Explicitly object migration on need

Enhancement: performance tuning for creating metadata, get, persist, and for concurrent connections

We haven't do benchmark and compare with similar systems, but that is actually necessary.

Conditions that we need to test:

allocating blobs: test the memory allocator.
creating metadata: test the metadata service
persist many objects: test the communication between etcd server.
getting metadata: test the performance of metadata lookup
getting blobs: test the performance implication of memory mapping
per-reserving memory: test the performance of memory allocator

Performance related issues:

#179 about putting an arrow table with tons of chunks into vineyard.

How to launch a vineyard cluster ?

Hi,
I'm puzzled about how the user can launch a vineyard cluster , Should i configure and start the etcd cluster firstly all
by myself ?
Is there any step or configurations i must pass to the libvineyard?

Buiding errors arising from the thridparty directory

Help,
I'm trying to build the libvineyard from source,
the error message "there is no CMAKELists.txt in thirdparty/nlohmann" occurs while i'm using the cmake tool-kit to build it,
And i also notice that all the direcotries in the thirdparty(like uri, pybind) seem to lack the CMAKELists.txt
Any suggestion for it?

vineyardd missing from pip packages

Hi, I'm trying out vineyard as alternative to PyArrow Plasma, given that the latter is currently unmaintained.

https://v6d.io/notes/getting-started.html#starting-vineyard-server recommends running vineyardd, whereas https://v6d.io/notes/install.html#install-vineyard recommends installing vineyard from pypi.

Unfortunately https://pypi.org/project/vineyard/ does not contain the vineyardd binary. Is it intended?

Any documents or demo use of the communicator for bulk data exchange?

Hi,
As the document mentioned in the architecture , which is linked below,
https://v6d.io/notes/divein.html#architecture

the communicator seems to act like a data manager for the bulk exchange between different vineyard nodes

But i did not find any API or demo use of this class to exchange data
between different vineyard instance.
And it seems that i can't find this class in the source release of libvineyard

wish for your aid

The vineyardd hangs when starting if etcd has millions of keys

Descirption of the problem

The vineyardd hangs when launching:

$GLOG_v=100 ./bin/vineyardd --socket=/tmp/vineyard.sockxxx
I1102 12:16:08.303436 104502 vineyardd.cc:65] Hello vineyard!
I1102 12:16:08.309496 104502 meta_service.h:123] start!

After print more detail logs, we could see:

I1102 12:16:10.214583 104511 etcd_meta_service.cc:126] etcd ls use 978080 microseconds: 0
I1102 12:16:10.214630 104511 etcd_meta_service.cc:136] kvs.size() = 0
I1102 12:16:10.214675 104511 etcd_meta_service.cc:139] status = Etcd error: Received message larger than max (18368675 vs. 4194304), error code: 8
I1102 12:16:10.214730 104502 meta_service.h:501] request all: Etcd error: Received message larger than max (18368675 vs. 4194304), error code: 8

the error should be propogated
vineyardd should work with etcd server that have many keys

Additional context

The hang-up problem itself could be fixed by etcd-cpp-apiv3/etcd-cpp-apiv3#21, but the

Bugs of `vineyard.connect` when interacting with pyarrow

We have obversed that the following code doesn't work:

import pyarrow as pa
import vineyard

client = vineyard.connect('/var/run/vineyard.sock')

But it import vineyard first, it works as expected.

Stream: can only be consumed once, by one consumer.

stream one producer one consumer control on server side

Describe your problem

stream one producer one consumer control

Descirption of the problem

Current one producer one consumer control is on the user's side, which is weak

Additional context

We would like to control this in vineyardd by marking the stream when it is open the first time

Optimize metadata management in etcd, to reduce the txn size when persisting

Describe your problem

Etcd doesn't do very well for many keys in a single txn, the default limitation of ops in a txn is 128, and etcdserver raises many warnings about "execute too long to execute" both in update and range query.

Meanwhile flatten metadatas in etcd doesn't bring many benefits.

Additional context

None

File system interface: invistages and implements the support for various cloud storage interface

Describe your problem

arrow filesystem: https://arrow.apache.org/docs/python/filesystems.html
filesystem-spec: https://filesystem-spec.readthedocs.io/en/latest/#
pyfilesystem: https://www.pyfilesystem.org/

Additional context

We may need to add ossfs support in fsspec, like s3fs.

boost-leaf's link is dead

Real Link is https://github.com/boostorg/leaf

Sync vineyard object's metadata with kubernetes as CRDs

Global objects are CRDs
Local objects that are referred by a global object is a CRDs
Local object CRD:
- key: sigature
- location: vineyardd that has such instances

Clearify the meaning of "distributed" in vineyard in documentation: we don't replicate data, rather, we do sharding

Describe your problem

We have come across the question about "do vineyard replicate data across nodes?" for several times, from end users and from the CNCF community. I think we need a section in the documentation to clearify our design point.

Maybe we also need to sharing such information in README.

Additional context

N/A

Version validation between client and server

Describe your problem

The compatibility between client and server should follows the semver, and we need to validate that when establishing the connection.

Additional context

None.

Replace boost/property_tree with other json library.

Describe your problem

Other json libraries, like nlohmann/json and rapidjson, has better performance, memory footprint, and the most importantly, the JSON comformance beyond boost/property_tree.

support the index structure of pandas

Describe your problem

compare to dataframe of pandas, the dataframe of vineyard not support index now.
store index as a member of vineyard object

initContainer: a tools to help data migration on kubernetes

pandas 1.2.0 incompatible issue

Describe your problem

pandas 1.2.0 includes some incompatible changes that break our builder and resolver for DataFrame.

Additional context

Also affects mars: mars-project/mars#1845

Vineyard-Operator Integration with kubernetes

Describe your problem

To make vineyard works with kubernetes, we need:

n.b.: some of the work has been merged to the dev/kubernetes branch.

basic support for deploying vineyard as DaemonSet
Python API like KubeCluster
A kubernetes scheduler plugin prototype, to demostrate data-aware scheduling
Distribute the vineyard operator to artifacthub to help others operate vineyard cluster easier.

Reduce the coupling with meta in `read_bytes.py`.

Implements "abort" semantic of build/seal in vineyard client

Describe your problem

When error occurs after seal part of the object into vineyard, we should support the user to recover from error and do proper cleanup in a easy fashion.

Additional context

Or at least, we should provides tools/utilities for users to archieve such goals.

Expose the exception messages from drivers to client.

Currently If a driver throws an exception, the error message could not be seen from the client, which is inconvenient for debugging.
Some works need to be done to expose the errors to user.

Revise vineyard wheel packages for macOS

Describe your problem

We have received bug report on macOS that ImportError occurs when import vineyard on Mac:

ImportError: dlopen(/usr/local/lib/python3.7/site-packages/vineyard-0.1.3-py3.7-macosx-10.14-x86_64.egg/vineyard/_C.cpython-37m-darwin.so, 2): Symbol not found: _aligned_alloc

Additional context

That may related to Homebrew/homebrew-core#46393 and Homebrew/homebrew-core#45585.

v6d-io / v6d Goto Github PK

v6d's Introduction

Table of Contents

What is vineyard

Features

Efficient In-Memory Immutable Data Sharing

Out-of-the-Box High-Level Data Abstractions

Stream Pipelining for Enhanced Performance

Versatile Drivers for Common Tasks

Try Vineyard

Deploying on Kubernetes

FAQ

Get Involved

Publications

Acknowledgements

License

v6d's People

Contributors

Stargazers

Watchers

Forkers

v6d's Issues

Recommend Projects

Recommend Topics

Recommend Org