Giter Club home page Giter Club logo

consortium-feedback's Introduction

Discussions & feedback

This repository is meant to be the central place where topics that are broader than a single repository can be discussed publicly. Everyone interested in the Consortium for Python Data API Standards is invited to ask questions, air concerns and suggest topics to cover or actions to take.

For topics that would best fit a forum or mailing list style discussion, please use the GitHub Discussions tab on this repository.

consortium-feedback's People

Contributors

rgommers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

consortium-feedback's Issues

[RFC] Adopt DLPack as cross-language C ABI stable data structure for array exchange

In order for an ndarray system to interact with a variety of frameworks, a stable in-memory data structure is needed.

DLPack is one such data structure that allows exchange between major frameworks. It is developed with inputs from many deep learning system core developers. Highlights include:

  • Minimum and stable: simple header
    • The spec has stayed roughly unchanged for more than four years.
  • Designed for cross hardware: CPU, CUDA, OpenCL, Vulkan, ROCm, Hexagon
  • Already a "standard" with wide community adoption and support, ones that I am aware of:
    • Frameworks, tensorflow/jax, pytorch, mxnet
    • Libraries: dgl, spaCy etc.
    • Compilers: TVM
  • Clean C ABI compatible
    • Means you can create and access it from any language
    • It is also essential for building JIT and AOT compilers to support these data types.
  • High performance consideration
    • Data field mandatory aligns to 256 bytes(for aligned load), allow byte_offset to offset the array if necessary

The main design rationale of DLPack is the minimalism. DLPack drops the consideration of allocator, device API and focus on the minimum data structure. While still considering the need for cross hardware support(e.g. the data field is opaque for platforms that does not support normal addressing).

It also simplifies some of the design to remove legacy issues(e.g. everything assumes to be row major, strides can be used to support other case, and avoid the complexity to consider more layouts)

After building the frameworks around the related data structures for a while, and see ecosystem grows around it, I am quite convinced that DLPack should be one important candidate, if not the best one for the C ABI array data structure.

Given that array exchange is one goal of the consortium, it would be great to see if dlpack can be used as the stable C ABI structure for array exchange.

If the proposal receives positive response from the community. We would be more than happy to explore options to build a neutral, open governance(e.g. like the apache model) that continues to oversees the evolution of the spec -- for example, donate the dlpack to the data-api consortium or host it in a clean github org. 

Empty Array versus Scalar Array Shape Convention

I was unable to determine the difference between shapes () and (0,) for arrays in the standard. For example nothing is mentioned for the method reshape, the attribute shape, or the array object references.

My confusion comes when running the test_full unit test, although I'm sure the same is true for other tests. Essentially both () and (0,) shapes are expected to return a non-empty array in the line ph.assert_fill.

Does the standard support empty arrays? If so, where can I find a description of the shape convention for empty arrays versus scalar shapeless arrays. Any help appreciated!

[RFC] Adopt einops as an API for shape-changing operations

Changing several core elements of array manipulations allows writing clean and reliable code AND resolves many conflicts.

Einops is a minimalist notation for tensor manipulations that replaces following operations:

  • view / reshape
  • reductions: max, min, mean, etc.
  • expand_dims, squeeze and unsqueeze
  • transpose/permute/rollaxis
  • repeat
  • tile
  • stack (and, in limited scenarios, concatenate/cat)
  • array_split/split/chunks
  • flatten

For example, transposition

x.transpose(0, 2, 3, 1)

In einops is written as

rearrange(x, 'b c h w -> b h w c')  # long names are also allowed

There are only three operations, please read tutorial.

Motivation

  • einops is already implemented and works over a number of frameworks

    • it works identically in all supported frameworks
    • existing codebase is pure python and has zero dependencies, so it can be easily moved inside frameworks and readjusted if needed
  • solves the problem of interface conflicts that data-apis tries to solve (only for some operations, but covered subset of operations is probably the main source of discrepancies)

    • while moving code from existing frameworks to 'unified API' mistakes are hard to avoid: torch.repeat does job of numpy.tile, but not numpy.repeat; flatten is defined differently by frameworks; set of axes as tuples or individual arguments, and other discrepancies
    • similarity of previous and new api would almost surely drive to "migration by substitution"
    • no conflicts with already existing operations is part of einops design
  • verbose API has multiple advantages (demonstrated here) and helps other people to follow code which is non-trivial for tensor packages

  • view / reshape are dangerous operations: they can completely break tensor structure, while mistakes done with them are almost non-catchable (when dealing with just a bunch of numbers in an intermediate activation, it is impossible to know those are ordered in a wrong way).

    • einops replaces these operations, adds checks and makes it impossible to break tensor structure

While this looks like a big leap, it is a surprisingly good moment to make it: move to more reliable and transferable code without depreciating parts of existing API. In this scenario first transition would be made by libs' developers, which will simplify following wider adoption.

In case of positive decision, introduction of new features and operations in einops will be first coordinated with/by consortium to ensure identical and efficient support by all frameworks.

Protocol for distributed data

In order for an ndarray/dataframe system to interact with a variety of frameworks in a distributed environment (such as clusters of workstations) a stable description of the distribution characteristics is needed.

The __partitioned__ protocol accomplishes this by defining a structure which provides the necessary meta-information. Implementations show that it allows data exchange on between different distributed frameworks without unnecessary data transmission or even copies.

The structure defines how the data is partitioned, where it is located and provides a function to access the local data. It does not define or provide any means of communication, messaging and/or resource management. It merely describes the current distribution state.

In a way, this is similar to the meta-data which dlpack provides for exchanging data within a single node (including the local GPU) - but it does it for data which lives on more than on process/node. It complements mechanism for intra-node exchange, such as dlpack, __array_interface__ and alike.

The current lack of such a structure typically leads to one of the following scenarios when connecting different frameworks:

  • a data consumer implements a dedicated import functionality for every distributed data container it sees important enough. As an example, xgboost_ray implements a variety of data_sources.
    This PR lets xgboost_ray automatically work with any new container that supports __partitioned__ (the extra code for modin DF and Ray/MLDataSet are no longer needed once they support it, too)
  • the user needs to explicitly deal with the distributed nature of the data. This either leads to unnecessary data transfer and/or developers need to understand internals of the data-container. In the latter case they get exposed to explicit parallelism/distribution while often the original intend of the producer was to hide exactly that.
    The implementation here avoids that by entirely hiding the distribution features but still allowing zero-copy data exchange between __partitioned__ (exemplified by modin (PR) and HeAT (PR)) and scikit-learn-intelex/daal4py.
  • frameworks like dask and ray wrap established APIs (such as pytorch, xgboost etc) and ask developers to switch to the new framework/API and to adopt their programing model.

A structure like __partitioned__ enables distributed data exchange between various frameworks which can be seamless to users while avoiding unnecessary data transfer.

The __partitioned__ protocol is an initial proposal, any feedback from this consortium will be highly appreciated. We would like to build a neutral, open governance that continues to oversee the evolution of the spec. For example, if some version of it receives positive response from the community we would be more than happy to donate the __partitioned__ protocol to the data-api consortium or host it in a clean github org.

Proposal: a high-level test suite to check equivalence of array modules

Hi all - @rsokl and I saw the launch announcement a few days ago, and as frequent array users we're very excited.  

That's not why I'm here though: we want to propose that the Consortium provide a high-level test suite for the interoperable or standardised parts of an array API.

We're core developers of the Hypothesis library for property-based testing (or structured/semantic fuzzing), and have done a lot of work on Numpy support and found a bunch of bugs (with fixes being the limiting factor).  I wrote up some advice as a paper for SciPy 2020.  Key points for the Consortium:

  • checking independent implementations for equivalent behaviour ("differential testing") has found very many bugs in e.g. compilers, and doesn't require a formal spec.  This pattern also makes running the test suite on new modules very easy!
  • randomised testing of even simple properties ("does not crash", "save/load is a no-op", etc) also finds very many bugs - in my experience most test functions I write for Numpy find novel bugs.
  • Hypothesis ships with a rich suite of functions to describe input data, including for Numpy and Pandas, and is already used by {numpy,astropy,pandas,xarray,...}.  This extends to e.g. sets of mututally-broadcastable shapes suitable for a given gufunc signature.
  • We even have a "ghostwriter" to help get started - try e.g. hypothesis write numpy.matmul or hypothesis write --equivalent numpy.add mxnet.add pytorch.add
  • As a proof of concept, @rsokl's autograd library mygrad relies heavily on Hypothesis to check both equivalence with Numpy and correctness of gradients; he credits property-based testing with making the project possible.  Many of mygrad's tests could easily adapted to form the core of a standard array module test suite - for example here are all two of the tests for matrix multiplication, testing forward and backprop through any operation on any arrays.

We're both keen to see the Consortium succeed, and happy to help out with testing if that would be welcome.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.