data-apis / consortium-feedback Goto Github PK

A repository for discussions related and for giving feedback on the Consortium

consortium-feedback's Introduction

Discussions & feedback

This repository is meant to be the central place where topics that are broader than a single repository can be discussed publicly. Everyone interested in the Consortium for Python Data API Standards is invited to ask questions, air concerns and suggest topics to cover or actions to take.

For topics that would best fit a forum or mailing list style discussion, please use the GitHub Discussions tab on this repository.

consortium-feedback's People

Contributors

Stargazers

Watchers

consortium-feedback's Issues

[RFC] Adopt DLPack as cross-language C ABI stable data structure for array exchange

In order for an ndarray system to interact with a variety of frameworks, a stable in-memory data structure is needed.

DLPack is one such data structure that allows exchange between major frameworks. It is developed with inputs from many deep learning system core developers. Highlights include:

Minimum and stable: simple header
- The spec has stayed roughly unchanged for more than four years.
Designed for cross hardware: CPU, CUDA, OpenCL, Vulkan, ROCm, Hexagon
Already a "standard" with wide community adoption and support, ones that I am aware of:
- Frameworks, tensorflow/jax, pytorch, mxnet
- Libraries: dgl, spaCy etc.
- Compilers: TVM
Clean C ABI compatible
- Means you can create and access it from any language
- It is also essential for building JIT and AOT compilers to support these data types.
High performance consideration
- Data field mandatory aligns to 256 bytes(for aligned load), allow byte_offset to offset the array if necessary

The main design rationale of DLPack is the minimalism. DLPack drops the consideration of allocator, device API and focus on the minimum data structure. While still considering the need for cross hardware support(e.g. the data field is opaque for platforms that does not support normal addressing).

It also simplifies some of the design to remove legacy issues(e.g. everything assumes to be row major, strides can be used to support other case, and avoid the complexity to consider more layouts)

After building the frameworks around the related data structures for a while, and see ecosystem grows around it, I am quite convinced that DLPack should be one important candidate, if not the best one for the C ABI array data structure.

Given that array exchange is one goal of the consortium, it would be great to see if dlpack can be used as the stable C ABI structure for array exchange.

If the proposal receives positive response from the community. We would be more than happy to explore options to build a neutral, open governance(e.g. like the apache model) that continues to oversees the evolution of the spec -- for example, donate the dlpack to the data-api consortium or host it in a clean github org.

Empty Array versus Scalar Array Shape Convention

I was unable to determine the difference between shapes () and (0,) for arrays in the standard. For example nothing is mentioned for the method reshape, the attribute shape, or the array object references.

My confusion comes when running the test_full unit test, although I'm sure the same is true for other tests. Essentially both () and (0,) shapes are expected to return a non-empty array in the line ph.assert_fill.

Does the standard support empty arrays? If so, where can I find a description of the shape convention for empty arrays versus scalar shapeless arrays. Any help appreciated!

[RFC] Adopt einops as an API for shape-changing operations

Changing several core elements of array manipulations allows writing clean and reliable code AND resolves many conflicts.

Einops is a minimalist notation for tensor manipulations that replaces following operations:

view / reshape
reductions: max, min, mean, etc.
expand_dims, squeeze and unsqueeze
transpose/permute/rollaxis
repeat
tile
stack (and, in limited scenarios, concatenate/cat)
array_split/split/chunks
flatten

For example, transposition

x.transpose(0, 2, 3, 1)

In einops is written as

rearrange(x, 'b c h w -> b h w c')  # long names are also allowed

There are only three operations, please read tutorial.

Motivation

einops is already implemented and works over a number of frameworks
- it works identically in all supported frameworks
- existing codebase is pure python and has zero dependencies, so it can be easily moved inside frameworks and readjusted if needed
solves the problem of interface conflicts that data-apis tries to solve (only for some operations, but covered subset of operations is probably the main source of discrepancies)
- while moving code from existing frameworks to 'unified API' mistakes are hard to avoid: torch.repeat does job of numpy.tile, but not numpy.repeat; flatten is defined differently by frameworks; set of axes as tuples or individual arguments, and other discrepancies
- similarity of previous and new api would almost surely drive to "migration by substitution"
- no conflicts with already existing operations is part of einops design
verbose API has multiple advantages (demonstrated here) and helps other people to follow code which is non-trivial for tensor packages
view / reshape are dangerous operations: they can completely break tensor structure, while mistakes done with them are almost non-catchable (when dealing with just a bunch of numbers in an intermediate activation, it is impossible to know those are ordered in a wrong way).
- einops replaces these operations, adds checks and makes it impossible to break tensor structure

While this looks like a big leap, it is a surprisingly good moment to make it: move to more reliable and transferable code without depreciating parts of existing API. In this scenario first transition would be made by libs' developers, which will simplify following wider adoption.

In case of positive decision, introduction of new features and operations in einops will be first coordinated with/by consortium to ensure identical and efficient support by all frameworks.

Protocol for distributed data

In order for an ndarray/dataframe system to interact with a variety of frameworks in a distributed environment (such as clusters of workstations) a stable description of the distribution characteristics is needed.

The __partitioned__ protocol accomplishes this by defining a structure which provides the necessary meta-information. Implementations show that it allows data exchange on between different distributed frameworks without unnecessary data transmission or even copies.

The structure defines how the data is partitioned, where it is located and provides a function to access the local data. It does not define or provide any means of communication, messaging and/or resource management. It merely describes the current distribution state.

In a way, this is similar to the meta-data which dlpack provides for exchanging data within a single node (including the local GPU) - but it does it for data which lives on more than on process/node. It complements mechanism for intra-node exchange, such as dlpack, __array_interface__ and alike.

The current lack of such a structure typically leads to one of the following scenarios when connecting different frameworks:

a data consumer implements a dedicated import functionality for every distributed data container it sees important enough. As an example, xgboost_ray implements a variety of data_sources.
This PR lets xgboost_ray automatically work with any new container that supports __partitioned__ (the extra code for modin DF and Ray/MLDataSet are no longer needed once they support it, too)
the user needs to explicitly deal with the distributed nature of the data. This either leads to unnecessary data transfer and/or developers need to understand internals of the data-container. In the latter case they get exposed to explicit parallelism/distribution while often the original intend of the producer was to hide exactly that.
The implementation here avoids that by entirely hiding the distribution features but still allowing zero-copy data exchange between __partitioned__ (exemplified by modin (PR) and HeAT (PR)) and scikit-learn-intelex/daal4py.
frameworks like dask and ray wrap established APIs (such as pytorch, xgboost etc) and ask developers to switch to the new framework/API and to adopt their programing model.

A structure like __partitioned__ enables distributed data exchange between various frameworks which can be seamless to users while avoiding unnecessary data transfer.

The __partitioned__ protocol is an initial proposal, any feedback from this consortium will be highly appreciated. We would like to build a neutral, open governance that continues to oversee the evolution of the spec. For example, if some version of it receives positive response from the community we would be more than happy to donate the __partitioned__ protocol to the data-api consortium or host it in a clean github org.

Proposal: a high-level test suite to check equivalence of array modules

Hi all - @rsokl and I saw the launch announcement a few days ago, and as frequent array users we're very excited.

That's not why I'm here though: we want to propose that the Consortium provide a high-level test suite for the interoperable or standardised parts of an array API.

We're core developers of the Hypothesis library for property-based testing (or structured/semantic fuzzing), and have done a lot of work on Numpy support and found a bunch of bugs (with fixes being the limiting factor). I wrote up some advice as a paper for SciPy 2020. Key points for the Consortium:

checking independent implementations for equivalent behaviour ("differential testing") has found very many bugs in e.g. compilers, and doesn't require a formal spec. This pattern also makes running the test suite on new modules very easy!
randomised testing of even simple properties ("does not crash", "save/load is a no-op", etc) also finds very many bugs - in my experience most test functions I write for Numpy find novel bugs.
Hypothesis ships with a rich suite of functions to describe input data, including for Numpy and Pandas, and is already used by {numpy,astropy,pandas,xarray,...}. This extends to e.g. sets of mututally-broadcastable shapes suitable for a given gufunc signature.
We even have a "ghostwriter" to help get started - try e.g. hypothesis write numpy.matmul or hypothesis write --equivalent numpy.add mxnet.add pytorch.add
As a proof of concept, @rsokl's autograd library mygrad relies heavily on Hypothesis to check both equivalence with Numpy and correctness of gradients; he credits property-based testing with making the project possible. Many of mygrad's tests could easily adapted to form the core of a standard array module test suite - for example here are all two of the tests for matrix multiplication, testing forward and backprop through any operation on any arrays.

We're both keen to see the Consortium succeed, and happy to help out with testing if that would be welcome.

data-apis / consortium-feedback Goto Github PK

consortium-feedback's Introduction

Discussions & feedback

consortium-feedback's People

Contributors

Stargazers

Watchers

consortium-feedback's Issues

[RFC] Adopt DLPack as cross-language C ABI stable data structure for array exchange

Empty Array versus Scalar Array Shape Convention

[RFC] Adopt einops as an API for shape-changing operations

Motivation

Protocol for distributed data

Proposal: a high-level test suite to check equivalence of array modules

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent