Giter Club home page Giter Club logo

ginn's Introduction

Ginn: Graph Interface to Neural Networks

A minimalistic, header only neural net library

cpu (build & test) gpu (build only)

Design goals

Ginn is inspired by dynamic computation graph approaches like PyTorch or DyNet. On top of that it strives for the following:

  • Ginn and all its dependencies are header only, which makes it easy to install. Still, headers are compartmentalized so you can include only what you need.
  • Minimalism in code. Readability is emphasized by having small code with minimally functioning building blocks. Hopes to be less aggressive towards newcomers to C++ and easy to extend. (Short) inlined definitions means everything can be seen at one place, also makes it stylistically familiar to python users. It is written with the mindset that a newcomer can quickly pick up the whole framework easily and figure out inner workings. Every user should be able to become a developer.
  • No centralized flow control. No central data structures to track a computation graph, graphs are only defined through nodes being connected. This makes it quite simple to handle threadsafety or multithreading parallelism.
  • No separation between a node in a computation graph and its parameters (or data). For example, a WeightNode object both implements its node behavior as well as acts as an (owning) container of the values of its parameters.
  • Small networks or batches are not second class citizens. Many real life systems have latency constraints that make it important to have good performance for small networks or small batch sizes (even completely online).
  • Simple autobatching. Automatically convert any computation graph to perform batched operations when it is nontrivial to write batched code, such as TreeLstms.

Example snippets

Mnist multilayer perceptron

auto [X, Y] = mnist_reader(train_file, bs);
auto [Xt, Yt] = mnist_reader(test_file, bs);

auto W = Weight(dev(), {dimh, dimx});
auto b = Weight(dev(), {dimh});
auto U = Weight(dev(), {dimy, dimh});
auto c = Weight(dev(), {dimy});

std::vector<WeightPtr<Real>> weights = {W, b, U, c};
init::Uniform<Real>().init(weights);
update::Adam<Real> updater(lr);

// Single instance (batch)
auto pass = [&](DataPtr<Real> x, Indices& y, auto& acc, bool train) {
  auto h = Affine<SigmoidOp>(W, x, b);
  auto y_ = Affine(U, h, c);
  auto loss = Sum(PickNegLogSoftmax(y_, y));

  auto graph = Graph(loss);
  graph.forward();
  acc.batched_add(argmax(y_->value(), 0).maybe_copy_to(cpu()), y);

  if (train) {
    graph.reset_grad();
    graph.backward(1.);
    updater.update(weights);
  }
};

std::mt19937 g(seed);

// Single epoch
auto pass_data = [&](auto& X, auto& Y, bool train) {
  metric::Accuracy<Int> acc;
  auto perm = train ? randperm(X.size(), g) : iota(X.size());
  for (size_t j : perm) { pass(X[j], Y[j], acc, train); }
  return 100. * (1. - acc.eval());
};

// Main training loop
std::cout << "TrErr%\tTstErr%\tsecs" << std::endl;
for (size_t e = 0; e < epochs; e++) {
  using namespace ginn::literals;
  timer::tic();
  std::cout << ("{:6.3f}\t"_f, pass_data(X, Y, true)) << std::flush;
  std::cout << ("{:6.3f}\t"_f, pass_data(Xt, Yt, false)) << timer::toc() / 1e6
            << std::endl;
}

Lstm step

template <typename DataPtrPair>
State step(const NodePtr<Scalar>& x, const DataPtrPair& past) {
  auto [h_past, c_past] = past;

  auto i = Affine<SigmoidOp>(Wix, x, Wih, h_past, Wic, c_past, bi);
  auto f = Affine<SigmoidOp>(Wfx, x, Wfh, h_past, Wfc, c_past, bf);
  auto g = Affine<TanhOp>(Wcx, x, Wch, h_past, bc);
  auto c = CwiseProd(f, c_past) + CwiseProd(i, g);
  auto o = Affine<SigmoidOp>(Wox, x, Woh, h_past, Woc, c, bo);
  auto h_ = Tanh(c);
  auto h = CwiseProd(o, h_);

  return {h, c};
}

ginn's People

Contributors

oir avatar

Watchers

 avatar

Forkers

oir

ginn's Issues

Composing expressions (of unary functions in particular)

  • Should I prefer free functions (current) or member functions, or both (or a mixture of both for different subsets)? For instance, Eigen uses a.cwiseProd(b) instead of CwiseProd(a, b), similarly PyTorch.
  • If member functions, is it a member function of Ptr<Node> or Node?

Some examples for all three:

NodePtr<> a, b;
auto c = CwiseProd(a, b);
auto c = a.cwise_prod(b);
auto c = a->cwise_prod(b);
NodePtr<> a;
auto b = Exp(Sum(Log(a), /*axis*/ 1));
auto b = a.log().sum(/*axis*/ 1).exp();
auto b = a->log()->sum(/*axis*/ 1)->exp();

Considering that we want expressions to be extensible by user, we can only use a library supporting fixed subset as possible member functions which is a drawback.

Due to some early discussions, leaning towards the third option (chaining by -> with pointer semantics, no changes to Ptr).

Another consideration here is that since this is going to be part of the core interface (either Ptr or Node), these core types will need to know about some derived node types which can lead to wonky forward declarations.

Gpu-enabled cpu runs are slow

If I build mnist.cu.cpp with -DGINN_ENABLE_GPU=0 and run, single epoch takes ~3.5s. If I build it with -DGINN_ENABLE_GPU=1 and run using the same docker instance (no gpu, falling back to cpu), single epoch takes ~30s.

Could be:

  • Optimizer flags are not properly set / sufficient
    • Although, I verified with verbose build that pxtas and compiler options are set to at least O3, is there anything else missing?
  • nvcc is doing a poor job somehow
    • Maybe test using cuda > 11.1

Investigate meson

And consider deprecating CMake if it's good. Important considerations are Cuda and pybind11.

Investigate single header only option

This is a bit ambitious but no harm in trying it out. Could have the necessary bits from Eigen so that the user doesn't even have to deal with Eigen inclusions.

Investigate removing Node::set_ins()

This was added to make some operations during autobatching possible -- seems like too big of an interface change just to enable this. Investigate if it's possible to get rid by adding one more indirection layer to ginn::Ptr, also benchmark the timing.

AddScalar and ProdScalar with scalar output nodes

Currently scalar operators expect actual scalars (e.g. Reals). However they should allow for size one (or maybe strictly rank-zero) tensor nodes coming from another computation.

E.g:

NodePtr<> x, y;
auto z = Sum(x) * y;

Here Sum(x) is a size one tensor (possibly rank 0) and it is meaningful to use it as a multiplication operand, but there is no such mechanism.

Add detailed error messages (ex: dimension mismatch)

Should include which objects mismatch and their shapes.

I was thinking to leave this to Eigen itself since you get errors in mismatches but they are hard assertions, so this makes sense. However until you call forward you don't know the shapes of inputs, therefore you won't see the error at the line where you define the node. Is that okay?

Alternative is to leave this to Eigen, and switch eigen assertions to exceptions, which we already to for Cpu only builds. For GPUs however exceptions break nvcc (or at least, used to).

Consider redesigning operator=() for tensors

Consider operator= copying the device as well ,i.e. perform proper deep copy, to avoid confusion. Maybe only having the special ctor, move_to(), and maybe_copy_to() is enough to transfer tensor across devices. This just seems confusing.

Determine semantics of "grad"/"backward" for integral scalar types

What does it mean to compute "gradients" for nodes that have Int or bool types? Several choices:

  • All gradients are zero -- may be mathematically correct (w.r.t. some assumptions I guess -- what is an infinitesimal change in integers?) but wasteful?
  • ::backward() is disabled altogether and has_grad is always false -- this would require template specialization and maybe SFINAE. Another option is to omit SFINAE, so ::backward() still exists and is callable but it would throw at runtime.
  • ::backward() works as if everything is floating and results become whatever they might be when casted back to ints -- this is the "no code change" case where I just instantiate the node with the scalar and assume everything works (some things will not work for sure). This is somewhat semantically similar to integer division of two ints, it's not mathematically correct w.r.t. real division, but it's somewhat approximately correct and somewhat reasonable.

Check devices of inputs when calling forward

Maybe other checks as well such as them being forwarded.

Might be worth to carry everything I have into a forward_ private virtual method and have a public, non-virtual forward that includes these checks and calls forward_. Would there be very niche Nodes that might want to disable these checks though?

Separate Backward Graph

Consider some expression like this:

NodePtr<> x = Random<>(cpu(), {2, 3});
for (size_t i = 0; i < 3; i++) {
  x = x + x;
}
auto y = LessThan(x, Random<>(cpu(), {2, 3}));

Graph g(y);
g.forward();
g.reset_grad();
g.backward(1.);

Graph currently will construct the entire graph and traverse in topological order, which is fine for forward. However when doing backward, we don't need the whole graph, because y won't backprop any nonzero gradients which means anything upstream is not getting any gradients either. Therefore, call to backward is unnecessarily looping over all the nodes. This is true even when has_grad is set to false for LessThanNode because AddNodes still have has_grad set to true (which is the right behavior since x can feed to something else later.

A possible solution then is to have a separate backward graph that is only reachable from the sink backwards through nonzero gradient connections only. Not sure how expensive this would be, whether it should be optional or default etc.

Make headers self includeable

Some headers are missing the necessary includes for being able to be included by themselves in isolation. Clang-tidy might help.

Investigate possibility of an eager mode

Maybe automatically forward() an eager node at the time of its construction.

If we assume inductively that it's inputs are eager (which should be necessary for its eagerness), then they should already be forwarded, therefore we can forward the downstream node safely without running any graph traversal.

Notion of eagerness can be derived automatically by eagerness of all inputs, similar to how devices can be derived from input node devices based on precedence.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.