Giter Club home page Giter Club logo

v6d-io / v6d Goto Github PK

View Code? Open in Web Editor NEW
804.0 28.0 116.0 19.13 MB

vineyard (v6d): an in-memory immutable data manager. (Project under CNCF, TAG-Storage)

Home Page: https://v6d.io

License: Apache License 2.0

CMake 2.53% C++ 50.01% Python 17.52% Shell 0.54% C 0.09% Smarty 0.06% Makefile 0.58% Dockerfile 0.10% Go 13.01% Rust 4.46% Java 10.45% Cuda 0.05% Scala 0.55% HiveQL 0.06%
distributed-comp distributed-systems in-memory-storage big-data-analytics graph-analytics shared-memory distributed cloud-native cncf sig-storage

v6d's People

Contributors

acezen avatar andydiwenzhu avatar chaitravi-ce avatar check-spelling-bot avatar chenrui333 avatar dashanji avatar dependabot[bot] avatar husimplicity avatar lidongze0629 avatar linlih avatar liusitan avatar luoxiaojian avatar peilii avatar pwrliang avatar rohan-cod avatar septicmk avatar shihao-thx avatar sighingnow avatar sighingsnow avatar sijie-l avatar siyuan0322 avatar songqing avatar thisisobate avatar univerone avatar vegetableysm avatar waruto210 avatar wuyueandrew avatar zhanglei1949 avatar zhuyi1159 avatar zjuytw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

v6d's Issues

Buiding errors arising from the thridparty directory

Help,
I'm trying to build the libvineyard from source,
the error message "there is no CMAKELists.txt in thirdparty/nlohmann" occurs while i'm using the cmake tool-kit to build it,
And i also notice that all the direcotries in the thirdparty(like uri, pybind) seem to lack the CMAKELists.txt
Any suggestion for it?

Accurate memory allocation failure diagnostic

Vineyard's memory allocator derived from plasma, and also suffers from issues in plasma store server. When specifying a large --size value (exceeds the available size of /dev/shm), creating blobs won't fail, but any touching (aka. read/write) on the memory will trigger a "SIGBUG", i.e., bus error.

Such a signal is hard for users to catch, and usually leads to "crash" of the client program, which is quite bad.

support the index structure of pandas

Describe your problem

compare to dataframe of pandas, the dataframe of vineyard not support index now.
store index as a member of vineyard object

Investigate the dataset-lifecycle-framework project

Describe your problem

The dataset-lifecycle-framework project abstracts data sources on S3 and NFS as CRD Dataset. By mapping the dataset into PersistentVolumeClaims and ConfigMaps, the dataset could be refered in Pods.

The dataset-lifecycle-framework project also has an operator, and use "labels" to indicate the dataset requirements in Pod. The operator is responsible for mounting the dataset properly. Just like we do in vineyard.

Vineyard use a similar syntax in "labels" of Pods to let the controller know which object the pod will require. Beyond that, vineyard operator considers the data locality and works as a scheduler plugin.

Additional context

N/A

Support Windows: MSVC on windows and WSL.

Describe your problem

The WSL2 should be trivial to support since it is a Linux running inside the virutal machine. But natively supporting Windows looks chanllanging:

  • we may need to make upstream dependecies (etcd-cpp-apiv3, etc.) to support windows first.
  • sharing a file descriptor over IPC socket may require a totally different codepath on Windows.

But this does worth a try.

Additional context

Such effort could be backported to apache-arrow as well.

Revise global object based on signature.

Describe your problem

  • Global object doesn't allow nestification.
  • Global object holds the signature of local object, rather than object ID, and do signature -> id resolution when getting object.

Additional context

This task depends on #113 .

stream one producer one consumer control on server side

Describe your problem

stream one producer one consumer control

Descirption of the problem

Current one producer one consumer control is on the user's side, which is weak

Additional context

We would like to control this in vineyardd by marking the stream when it is open the first time

Return ErrorCode with namespace in macros

Describe your problem

  • I occurred an error on macOS when building graphscope. error messages are shown as below:
/Users/yecol/git-repo/graphscope-edu/analytical_engine/core/fragment/arrow_projected_fragment.h:891:7: error: use of undeclared identifier
      'ErrorCode'; did you mean 'vineyard::ErrorCode'?
      ARROW_OK_OR_RAISE(begins_builder.Append(ret.first));
      ^
/usr/local/include/vineyard/graph/utils/error.h:220:23: note: expanded from macro 'ARROW_OK_OR_RAISE'
      RETURN_GS_ERROR(ErrorCode::kArrowError, (status_name).ToString()); \
                      ^
/usr/local/include/vineyard/graph/utils/error.h:31:12: note: 'vineyard::ErrorCode' declared here
enum class ErrorCode {
           ^
In file included from /Users/yecol/git-repo/graphscope-edu/analytical_engine/core/grape_instance.cc:20:
In file included from /Users/yecol/git-repo/graphscope-edu/analytical_engine/core/context/tensor_context.h:37:
/Users/yecol/git-repo/graphscope-edu/analytical_engine/core/fragment/arrow_projected_fragment.h:892:7: error: use of undeclared identifier
      'ErrorCode'; did you mean 'vineyard::ErrorCode'?
      ARROW_OK_OR_RAISE(ends_builder.Append(ret.second));
      ^
/usr/local/include/vineyard/graph/utils/error.h:220:23: note: expanded from macro 'ARROW_OK_OR_RAISE'
      RETURN_GS_ERROR(ErrorCode::kArrowError, (status_name).ToString()); \

I suggest we return the errorcode enum class together with namespace.
Since libvineyard may be used by another project which is on the outside of namespace vineyard. Just like graphscope.

  • btw, I also suggest an inprovement on find_vineyard, give a hint like other libraries:
NOT_Found libvineyard
# or 
Found libvineyard, (include: /usr/local/include, library: /usr/local/lib/libvineyard.dylib)

Descirption of the problem

If it is a bug report, to help us reproducing this bug, please provide information below:

  1. Your Operation System version (uname -a): macOS
  2. The version of libvineyard you use (vineyard.__version__): latest
  3. Versions of crucial packages, such as mpi:
  4. Full stack of the error: aforementioned
  5. Minimized code to reproduce the error: cmake && make -j4 on macOS

If it is a feature request, please provides a clear and concise description of what you want to happen:

  1. What is the problem:
  2. The behaviour that you expect to work:

Additional context

Add any other context about the problem here.

Investigate the roadmap of integration with fluid project

Describe your problem

The fluid project manages a data cache layer (provided by alluxio or jindofs) for on-top bigdata applications. Fluid abstract the data caching services as a Dataset, and mount it to worker pods via a CSI driver. Besides, based the affinity of volumes, fluid also considering the data locality when scheduling.

Vineyard also consider the data locality when scheduling. Currently vineyard abstracts objects as CRDs to make them visible for kubernetes. And volumes also fit vineyard's design well. Integrating with fluid would benifits vineyard to make the sheduler plugin and vineyard-operator more lightweight.

To be abbreviated, vineyard could work as a fluid runtime. Things need to be done to work together with fluid are as follows (if
I understand it correctly):

  • allowing mount an vineyard object as a volume. Vineyard server may need to provided a seperated IPC service for that.
  • implements a controller for vineyard in fluid (a fluid runtime).

Additional context

N/A

The vineyardd hangs when starting if etcd has millions of keys

Descirption of the problem

The vineyardd hangs when launching:

$GLOG_v=100 ./bin/vineyardd --socket=/tmp/vineyard.sockxxx
I1102 12:16:08.303436 104502 vineyardd.cc:65] Hello vineyard!
I1102 12:16:08.309496 104502 meta_service.h:123] start!

After print more detail logs, we could see:

I1102 12:16:10.214583 104511 etcd_meta_service.cc:126] etcd ls use 978080 microseconds: 0
I1102 12:16:10.214630 104511 etcd_meta_service.cc:136] kvs.size() = 0
I1102 12:16:10.214675 104511 etcd_meta_service.cc:139] status = Etcd error: Received message larger than max (18368675 vs. 4194304), error code: 8
I1102 12:16:10.214730 104502 meta_service.h:501] request all: Etcd error: Received message larger than max (18368675 vs. 4194304), error code: 8
  1. the error should be propogated
  2. vineyardd should work with etcd server that have many keys

Additional context

The hang-up problem itself could be fixed by etcd-cpp-apiv3/etcd-cpp-apiv3#21, but the

Optimize metadata management in etcd, to reduce the txn size when persisting

Describe your problem

Etcd doesn't do very well for many keys in a single txn, the default limitation of ops in a txn is 128, and etcdserver raises many warnings about "execute too long to execute" both in update and range query.

Meanwhile flatten metadatas in etcd doesn't bring many benefits.

Additional context

None

Vineyard-Operator Integration with kubernetes

Describe your problem

To make vineyard works with kubernetes, we need:

n.b.: some of the work has been merged to the dev/kubernetes branch.

  • basic support for deploying vineyard as DaemonSet
  • Python API like KubeCluster
  • A kubernetes scheduler plugin prototype, to demostrate data-aware scheduling
  • Distribute the vineyard operator to artifacthub to help others operate vineyard cluster easier.

Enabling limited RPC data access functionality for streams.

Describe your problem

To handle cases like "opening a numpy ndarray as a stream".

I must admit that it is controversal and counterintuitive to add such feature to vineyardd. I would like to know how others think about that.

Additional context

N/A.

[BUG] Vineyard's builders and resolvers for python types doesn't fit very well with python itself (and numpy, pandas)

Describe your problem

  • Vineyard python client crashes when get empty np.array from vineyard (#137).

    In [1]: import vineyard
    
    In [2]: cl = vineyard.connect("/tmp/vineyard.sock")
    
    In [3]: cl.put(np.array(()))
    Out[3]: o00019a42596292fc
    
    In [4]: cl.get(cl.put(np.array(())))
    WARNING: Logging before InitGoogleLogging() is written to STDERR
    I0108 12:57:40.512176 342670784 object_factory.cc:26] Failed to create an instance due to the unknown typename: vineyard::Tensor<float64>
    [1]    14403 segmentation fault  ipython
  • vary-sized tuple (#137)

  • numpy ndarray with OBJECT type (hard to fixes).

  • sparse arrays are stored in a continous flatten manner

  • tensor order (easy to fixes) (#137)

  • string type.

    • fixed-sized string / fixed-sized binary type
    • vary-sized string / python object

Additional context

N/A

Implements "abort" semantic of build/seal in vineyard client

Describe your problem

When error occurs after seal part of the object into vineyard, we should support the user to recover from error and do proper cleanup in a easy fashion.

Additional context

Or at least, we should provides tools/utilities for users to archieve such goals.

Enhancement: better support for memory hierarchy

Describe your problem

Vineyard addresses sharing immutable data in-memory, but sometimes checkpoint or swap data to storage devices do make sense.

Vineyard already support object serailization/deserailzation in #164, but that is not enough.

(WIP) more details of the design will be added later.

Additional context

N/A

How to launch a vineyard cluster ?

Hi,
I'm puzzled about how the user can launch a vineyard cluster , Should i configure and start the etcd cluster firstly all
by myself ?
Is there any step or configurations i must pass to the libvineyard?

Scalability issues of pandas_dataframe_builder and pyarrow table_builder

I have a pyarrow table with columns composed of 21036 chunks (21032413 rows). Storing the table or the equivalent pandas dataframe in vineyard does not succeed in reasonable amount of time (I waited a few minutes) but hangs in record_batch_builder (and the equivalent function for pandas respectively).

That's unlike the pyarrow plasma implementation, which is implemented in C++ and just takes a few seconds:
https://github.com/apache/arrow/blob/995abdc02fed412bbd947fe41a0765036dbbe820/cpp/src/arrow/python/serialize.cc#L588-L599

Do you intend to match performance with pyarrow for large tables? Or is this out of scope for the project?

Any documents or demo use of the communicator for bulk data exchange?

Hi,
As the document mentioned in the architecture , which is linked below,
https://v6d.io/notes/divein.html#architecture

the communicator seems to act like a data manager for the bulk exchange between different vineyard nodes

But i did not find any API or demo use of this class to exchange data
between different vineyard instance.
And it seems that i can't find this class in the source release of libvineyard

wish for your aid

Enhancement: performance tuning for creating metadata, get, persist, and for concurrent connections

We haven't do benchmark and compare with similar systems, but that is actually necessary.

Conditions that we need to test:

  • allocating blobs: test the memory allocator.
  • creating metadata: test the metadata service
  • persist many objects: test the communication between etcd server.
  • getting metadata: test the performance of metadata lookup
  • getting blobs: test the performance implication of memory mapping
  • per-reserving memory: test the performance of memory allocator

Performance related issues:

  • #179 about putting an arrow table with tons of chunks into vineyard.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.