ginn-org / ginn Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 1.0 8.53 MB

A minimalistic, header only neural net library

Home Page: https://ginn-org.github.io/ginn/

License: Apache License 2.0

CMake 0.15% Makefile 0.64% C++ 80.97% C 0.28% Python 7.75% Meson 10.22%

ginn's Introduction

Ginn: Graph Interface to Neural Networks

A minimalistic, header only neural net library

Design goals

Ginn is inspired by dynamic computation graph approaches like PyTorch or DyNet. On top of that it strives for the following:

Ginn and all its dependencies are header only, which makes it easy to install. Still, headers are compartmentalized so you can include only what you need.
Minimalism in code. Readability is emphasized by having small code with minimally functioning building blocks. Hopes to be less aggressive towards newcomers to C++ and easy to extend. (Short) inlined definitions means everything can be seen at one place, also makes it stylistically familiar to python users. It is written with the mindset that a newcomer can quickly pick up the whole framework easily and figure out inner workings. Every user should be able to become a developer.
No centralized flow control. No central data structures to track a computation graph, graphs are only defined through nodes being connected. This makes it quite simple to handle threadsafety or multithreading parallelism.
No separation between a node in a computation graph and its parameters (or data). For example, a WeightNode object both implements its node behavior as well as acts as an (owning) container of the values of its parameters.
Small networks or batches are not second class citizens. Many real life systems have latency constraints that make it important to have good performance for small networks or small batch sizes (even completely online).
Simple autobatching. Automatically convert any computation graph to perform batched operations when it is nontrivial to write batched code, such as TreeLstms.

Example snippets

Mnist multilayer perceptron

auto [X, Y] = mnist_reader(train_file, bs);
auto [Xt, Yt] = mnist_reader(test_file, bs);

auto W = Weight(dev(), {dimh, dimx});
auto b = Weight(dev(), {dimh});
auto U = Weight(dev(), {dimy, dimh});
auto c = Weight(dev(), {dimy});

std::vector<WeightPtr<Real>> weights = {W, b, U, c};
init::Uniform<Real>().init(weights);
update::Adam<Real> updater(lr);

// Single instance (batch)
auto pass = [&](DataPtr<Real> x, Indices& y, auto& acc, bool train) {
  auto h = Affine<SigmoidOp>(W, x, b);
  auto y_ = Affine(U, h, c);
  auto loss = Sum(PickNegLogSoftmax(y_, y));

  auto graph = Graph(loss);
  graph.forward();
  acc.batched_add(argmax(y_->value(), 0).maybe_copy_to(cpu()), y);

  if (train) {
    graph.reset_grad();
    graph.backward(1.);
    updater.update(weights);
  }
};

std::mt19937 g(seed);

// Single epoch
auto pass_data = [&](auto& X, auto& Y, bool train) {
  metric::Accuracy<Int> acc;
  auto perm = train ? randperm(X.size(), g) : iota(X.size());
  for (size_t j : perm) { pass(X[j], Y[j], acc, train); }
  return 100. * (1. - acc.eval());
};

// Main training loop
std::cout << "TrErr%\tTstErr%\tsecs" << std::endl;
for (size_t e = 0; e < epochs; e++) {
  using namespace ginn::literals;
  timer::tic();
  std::cout << ("{:6.3f}\t"_f, pass_data(X, Y, true)) << std::flush;
  std::cout << ("{:6.3f}\t"_f, pass_data(Xt, Yt, false)) << timer::toc() / 1e6
            << std::endl;
}

Lstm step

template <typename DataPtrPair>
State step(const NodePtr<Scalar>& x, const DataPtrPair& past) {
  auto [h_past, c_past] = past;

  auto i = Affine<SigmoidOp>(Wix, x, Wih, h_past, Wic, c_past, bi);
  auto f = Affine<SigmoidOp>(Wfx, x, Wfh, h_past, Wfc, c_past, bf);
  auto g = Affine<TanhOp>(Wcx, x, Wch, h_past, bc);
  auto c = CwiseProd(f, c_past) + CwiseProd(i, g);
  auto o = Affine<SigmoidOp>(Wox, x, Woh, h_past, Woc, c, bo);
  auto h_ = Tanh(c);
  auto h = CwiseProd(o, h_);

  return {h, c};
}

ginn's People

Contributors

Watchers

Forkers

oir

ginn's Issues

Investigate Onnx support

for saving to & loading from onnx formats.

Composing expressions (of unary functions in particular)

Should I prefer free functions (current) or member functions, or both (or a mixture of both for different subsets)? For instance, Eigen uses a.cwiseProd(b) instead of CwiseProd(a, b), similarly PyTorch.
If member functions, is it a member function of Ptr<Node> or Node?

Some examples for all three:

NodePtr<> a, b;
auto c = CwiseProd(a, b);
auto c = a.cwise_prod(b);
auto c = a->cwise_prod(b);

NodePtr<> a;
auto b = Exp(Sum(Log(a), /*axis*/ 1));
auto b = a.log().sum(/*axis*/ 1).exp();
auto b = a->log()->sum(/*axis*/ 1)->exp();

Considering that we want expressions to be extensible by user, we can only use a library supporting fixed subset as possible member functions which is a drawback.

Due to some early discussions, leaning towards the third option (chaining by -> with pointer semantics, no changes to Ptr).

Another consideration here is that since this is going to be part of the core interface (either Ptr or Node), these core types will need to know about some derived node types which can lead to wonky forward declarations.

Add modules for automagically downloading datasets

https://github.com/yhirose/cpp-httplib
Need to figure out how to use https with cpp-httplib...

Add Elman RNNs and GRUs

Gpu-enabled cpu runs are slow

If I build mnist.cu.cpp with -DGINN_ENABLE_GPU=0 and run, single epoch takes ~3.5s. If I build it with -DGINN_ENABLE_GPU=1 and run using the same docker instance (no gpu, falling back to cpu), single epoch takes ~30s.

Could be:

Optimizer flags are not properly set / sufficient
- Although, I verified with verbose build that pxtas and compiler options are set to at least O3, is there anything else missing?
nvcc is doing a poor job somehow
- Maybe test using cuda > 11.1

Investigate meson

And consider deprecating CMake if it's good. Important considerations are Cuda and pybind11.

Investigate single header only option

This is a bit ambitious but no harm in trying it out. Could have the necessary bits from Eigen so that the user doesn't even have to deal with Eigen inclusions.

Investigate using parent c instead of children's c when computing o in TreeLstm

Investigate if there is any speed benefits of unrolling Eigen summation in Affine node

Investigate incorporating FasterTransformer

Investigate removing Node::set_ins()

This was added to make some operations during autobatching possible -- seems like too big of an interface change just to enable this. Investigate if it's possible to get rid by adding one more indirection layer to ginn::Ptr, also benchmark the timing.

AddScalar and ProdScalar with scalar output nodes

Currently scalar operators expect actual scalars (e.g. Reals). However they should allow for size one (or maybe strictly rank-zero) tensor nodes coming from another computation.

E.g:

NodePtr<> x, y;
auto z = Sum(x) * y;

Here Sum(x) is a size one tensor (possibly rank 0) and it is meaningful to use it as a multiplication operand, but there is no such mechanism.

Investigate using cuda streams for async GPU operations

Add API documentation

Deprecate core.h

I think this can go into def.h or some other place.

Check feature parity between Python and C++

...after pybind11 work is complete. Maybe reimplement some examples in Python?

Add autobatching for Conv nodes

Visualize computation graph with dotviz

Write ginn computation graph to a dot file which can be visualized with dotviz.

Add detailed error messages (ex: dimension mismatch)

Should include which objects mismatch and their shapes.

I was thinking to leave this to Eigen itself since you get errors in mismatches but they are hard assertions, so this makes sense. However until you call forward you don't know the shapes of inputs, therefore you won't see the error at the line where you define the node. Is that okay?

Alternative is to leave this to Eigen, and switch eigen assertions to exceptions, which we already to for Cpu only builds. For GPUs however exceptions break nvcc (or at least, used to).

Consider redesigning operator=() for tensors

Consider operator= copying the device as well ,i.e. perform proper deep copy, to avoid confusion. Maybe only having the special ctor, move_to(), and maybe_copy_to() is enough to transfer tensor across devices. This just seems confusing.

Add operator<< support for Tensors, Shapes, .. ?

Recent intro of tensorio.h already does this for tensors. What about Shapes? Maybe Nodes?

Run clang-tidy

Run clang-tidy and make fixes or exceptions.

Investigate Cudnn support

Determine semantics of "grad"/"backward" for integral scalar types

What does it mean to compute "gradients" for nodes that have Int or bool types? Several choices:

All gradients are zero -- may be mathematically correct (w.r.t. some assumptions I guess -- what is an infinitesimal change in integers?) but wasteful?
::backward() is disabled altogether and has_grad is always false -- this would require template specialization and maybe SFINAE. Another option is to omit SFINAE, so ::backward() still exists and is callable but it would throw at runtime.
::backward() works as if everything is floating and results become whatever they might be when casted back to ints -- this is the "no code change" case where I just instantiate the node with the scalar and assume everything works (some things will not work for sure). This is somewhat semantically similar to integer division of two ints, it's not mathematically correct w.r.t. real division, but it's somewhat approximately correct and somewhat reasonable.

Check devices of inputs when calling forward

Maybe other checks as well such as them being forwarded.

Might be worth to carry everything I have into a forward_ private virtual method and have a public, non-virtual forward that includes these checks and calls forward_. Would there be very niche Nodes that might want to disable these checks though?

Add Crf

Separate Backward Graph

Consider some expression like this:

NodePtr<> x = Random<>(cpu(), {2, 3});
for (size_t i = 0; i < 3; i++) {
  x = x + x;
}
auto y = LessThan(x, Random<>(cpu(), {2, 3}));

Graph g(y);
g.forward();
g.reset_grad();
g.backward(1.);

Graph currently will construct the entire graph and traverse in topological order, which is fine for forward. However when doing backward, we don't need the whole graph, because y won't backprop any nonzero gradients which means anything upstream is not getting any gradients either. Therefore, call to backward is unnecessarily looping over all the nodes. This is true even when has_grad is set to false for LessThanNode because AddNodes still have has_grad set to true (which is the right behavior since x can feed to something else later.

A possible solution then is to have a separate backward graph that is only reachable from the sink backwards through nonzero gradient connections only. Not sure how expensive this would be, whether it should be optional or default etc.

Make headers self includeable

Some headers are missing the necessary includes for being able to be included by themselves in isolation. Clang-tidy might help.

Investigate possibility of an eager mode

Maybe automatically forward() an eager node at the time of its construction.

If we assume inductively that it's inputs are eager (which should be necessary for its eagerness), then they should already be forwarded, therefore we can forward the downstream node safely without running any graph traversal.

Notion of eagerness can be derived automatically by eagerness of all inputs, similar to how devices can be derived from input node devices based on precedence.

Add language modeling example with beam search

Min-gpt already exists, so maybe this issue is simply adding beam search to that example.