idsia / brainstorm Goto Github PK

Fast, flexible and fun neural networks.

License: Other

Makefile 0.28% Python 99.72%

brainstorm's Introduction

Brainstorm

Discontinuation Notice
Brainstorm is no longer being maintained, so we recommend using one of the many other,available frameworks, such as Tensorflow or Chainer. These and similar large projects are supported much more actively by a larger number of contributors. They provide, or plan to provide many available and planned features of brainstorm, and have several advantages, particularly in speed. In order to avoid fragmentation and waste of effort, we have decided to discontinue the brainstorm project and contribute to other frameworks and related projects such as Sacred instead. Many thanks to everyone who contributed! For us it has been a thoroughly enjoyable and educational experience.

Discontinuation Notice

Brainstorm is no longer being maintained, so we recommend using one of the many other,available frameworks, such as Tensorflow or Chainer. These and similar large projects are supported much more actively by a larger number of contributors. They provide, or plan to provide many available and planned features of brainstorm, and have several advantages, particularly in speed. In order to avoid fragmentation and waste of effort, we have decided to discontinue the brainstorm project and contribute to other frameworks and related projects such as Sacred instead. Many thanks to everyone who contributed! For us it has been a thoroughly enjoyable and educational experience.

Brainstorm makes working with neural networks fast, flexible and fun.

Combining lessons from previous projects with new design elements, and written entirely in Python, Brainstorm has been designed to work on multiple platforms with multiple computing backends.

Getting Started

A good point to start is the brief walkthrough of the cifar10_cnn.py example.
More documentation is in progress, and hosted on ReadTheDocs. If you wish, you can also run the data preparation scripts (data directory) and look at some basic examples (examples directory).

Status

Brainstorm is discontinued.

The currently available feature set includes recurrent (simple, LSTM, Clockwork), 2D convolution/pooling, Highway and batch normalization layers. API documentation is fairly complete and we are currently working on tutorials and usage guides.

Brainstorm abstracts computations via handlers with a consistent API. Currently, two handlers are provided: NumpyHandler for computations on the CPU (through Numpy/Cython) and PyCudaHandler for the GPU (through PyCUDA and scikit-cuda).

Installation

Here are some quick instructions for installing the latest master branch on Ubuntu.

# Install pre-requisites
sudo apt-get update
sudo apt-get install python-dev libhdf5-dev git python-pip
# Get brainstorm
git clone https://github.com/IDSIA/brainstorm
# Install
cd brainstorm
[sudo] pip install -r requirements.txt
[sudo] python setup.py install
# Build local documentation (optional)
sudo apt-get install python-sphinx
make docs
# Install visualization dependencies (optional)
sudo apt-get install graphviz libgraphviz-dev pkg-config
[sudo] pip install pygraphviz --install-option="--include-path=/usr/include/graphviz" --install-option="--library-path=/usr/lib/graphviz/"

To use your CUDA installation with brainstorm:

$ [sudo] pip install -r pycuda_requirements.txt

Set location for storing datasets:

echo "export BRAINSTORM_DATA_DIR=/home/my_data_dir/" >> ~/.bashrc

Help and Support

If you have any suggestions or questions, please post to the Google group.

If you encounter any errors or problems, please let us know by opening an issue.

License

MIT License. Please see the LICENSE file.

Acknowledgements and Citation

Klaus Greff and Rupesh Srivastava would like to thank Jürgen Schmidhuber for his continuous supervision and encouragement. Funding from EU projects NASCENCE (FP7-ICT-317662) and WAY (FP7-ICT-288551) was instrumental during the development of this project. We also thank Nvidia Corporation for their donation of GPUs.

If you use Brainstorm in your research, please cite us as follows:

Klaus Greff, Rupesh Kumar Srivastava and Jürgen Schmidhuber. 2016. Brainstorm: Fast, Flexible and Fun Neural Networks, Version 0.5. https://github.com/IDSIA/brainstorm

Bibtex:

@misc{brainstorm2015,
  author = {Klaus Greff and Rupesh Kumar Srivastava and Jürgen Schmidhuber},
  title = {{Brainstorm: Fast, Flexible and Fun Neural Networks, Version 0.5}},
  year = {2015},
  url = {https://github.com/IDSIA/brainstorm}
}

brainstorm's People

Contributors

Stargazers

Watchers

Forkers

flukeskywalker leandroloi jzilly amiltonwong uestcwangxiao darioromero yanweifu robertbarron zencoding hxi arnabgho inigo0178 liangkai izie lorlor pangm teuneboon kreukle gotoc lu839684437 fangde yanlinaung wavelets wombatpm rtvt123 ml-lab walkandtalkand69 kod3r seleucia pwplus robink87 qbektrix adrianhust halcy snazz2001 stuart1216 0xjac r0fls robin-- nagyistoce iver56 kashif deercoderresearch itackin codeaudit webmaven michaelwand charlottenosam timdettmers cyberpirate92 icemansina milesqli galaxyh likeucode derekcoleman technologiclee sandy4321 caomw cloudxtreme timsuchanek ashokpant wesavetheworld heiterwelt masterdezign adityosanjaya lijian8 testerrandolph osambezy nagyistge liguangbz aiyunfeng gaurav8b cpehle edamaraju desperado1992 vlimant chanderwoo geshiming nikolayvoronchikhin miguellissimo zachlungu wycharry charleson kiwisinspace sunatthegilddotcom m1ch cequencer hushuitian gavinhwa datavizweb rendezvoush solertis awesome-ml maxosprojects vyraun 690312856 ntuanhung jinyu0310 hhy5277 vonrosenchild

brainstorm's Issues

Changing the conv memory layout

We just briefly discussed this and I think it would be good to change the buffer layout for our convolutional layers from 1 to 2:

(time, batch_size, channels, height, width)
(time, batch_size, height, width, channels)

Pros

Batch-normalization for conv layers would be easier, because there you want to estimate the mean/std over all time, batch_size, height, and width but not channels. And that you could then do with a single flatten operation.
When writing an 2D-RNN or 2D-LSTM implementation, it would then probably be possible to do the matrix multiplication for the recurrent matrices with a single BLAS call over all images in the batch, because only a single striding is needed, and not two.
It makes more sense from a memory layout perspective, because each filter of a convolution uses all channels but not the entire width/height.
OpenCV and apparently caffe2 use that format

Cons

Caffe uses format 1
We need to change our code :-(

FullyConnectedLayer forward pass fails

Because the internals.Ha view is now a sliced view, numpy.dot complains

 -->  H.dot(flat_input, WX, flat_Ha)
ValueError: output array is not acceptable (must have the right type, nr dimensions, and be a C-Array)

This is a bit annoying. We might be able to solve it by manually iterating over time.
Or (for some cases including this one) the problem would vanish if we switch to hub-wise memory layout again...

Make Hooks more independent

If we remove monitor_kwargs from the Trainer, and start() from the hooks, it will become very simple to create and call a standalone hook without using a trainer. This seems to be useful/required.

Some alterations to the call signature for hooks might needed to make things less awkward.

Any objections to this?

Adjust layer internal state sensitivity tests to exclude context slices

Benchmarking and Profiling

I think we should run some rudimentary profiling before the initial release, just to get rid of the worst performance offenders.

On a related note, it'd be nice to have an automatic benchmarking script, that times all handler operations and layer passes. That way it would be easy to measure the effect of speeding up handler operations and layer implementations.

Architecture handling should be unified

python interface => description
description => architecture datastructure (arch)
arch => validation, cycle checking, and sorting
arch => buffer layout

CUDA convolutional/pooling layers

Addition of convolution/pooling layers will significantly add capabilities to the library. The base has been designed keeping this in mind, so having these is a pre-alpha goal.

mnist_lstm example fails

Using PyCudaHandler, I get:

[Traceback (most recent call last):
  File "mnist_lstm.py", line 85, in <module>
    trainer.train(network, train_getter, valid_getter=valid_getter)
  File "build/bdist.linux-x86_64/egg/brainstorm/training/trainer.py", line 50, in train
  File "build/bdist.linux-x86_64/egg/brainstorm/training/trainer.py", line 126, in _emit_monitoring
  File "build/bdist.linux-x86_64/egg/brainstorm/training/trainer.py", line 136, in _call_monitor
  File "build/bdist.linux-x86_64/egg/brainstorm/training/monitors.py", line 289, in __call__
  File "build/bdist.linux-x86_64/egg/brainstorm/training/trainer.py", line 169, in run_network
  File "build/bdist.linux-x86_64/egg/brainstorm/structure/network.py", line 62, in provide_external_data
  File "build/bdist.linux-x86_64/egg/brainstorm/handlers/pycuda_handler.py", line 49, in copy_to
pycuda._driver.LogicError: cuMemcpyDtoD failed: invalid argument in <brainstorm.training.monitors.MonitorAccuracy object at 0x7fea2cf61050>

@untom, perhaps you have an idea about what might be causing this?

Better Error message for wrong layer name in hooks

Clarify Data Iterator behaviors

Behavior such as nesting will depend on the type of data iterators. Currently, we can nest all available iterators such that the innermost ones are Online, Minibatches or Undivided. This works since all iterators work with Numpy arrays.

Once we have database iterators, things may change: the data attribute of the iterator may not contain named Numpy arrays as it currently does, since the entire dataset can not be held in memory. For such settings, we may choose to generalize iterators (so that they change behavior based on data type), or implement a separate set of iterators which can not be mixed. For example, we can have NumpyDataIterators and DatabaseIterators.

The best way forward might become clearer once we start working with larger datasets stored in databases/files.

Standardize parameter structure

I think we should (or need to) rearrange the axes of the weight matrices. The convention should be that the first axis always corresponds to the number of units/neurons in the layer. This will:

Be compatible with parameter structures in Caffe etc.
Simplify code for writing weight modifiers etc.

This does not require too much work yet. If it sounds good I can do it.

Plan for Monitors

Issue A

Proposals:

Rename MonitorAccuracy to MonitorClassificationAccuracy. Additionally, later implement MonitorMeanSquaredError etc. We already have a separate MonitorHammingScore.
Implement MonitorScore/MonitorError (a better name) which allows the options to monitor various types of scores/errors such as classification error, hamming score, mean squared error etc.

2 will prevent code duplication. 1 might be more intuitive to use?

Issue B

Additionally, me and @Qwlouse talked some time ago about perhaps renaming Monitors (perhaps call them 'Hooks'?) This is because Monitors can do more than just monitoring stuff.

Support for Context slices

add number of context frames as layer property
support context slice in buffer organization
add method to copy the context
add method to set the context

Test layers overwrite fwd and bwd states

Test Layer adds to in_deltas

All buffers have to be of the same dtype

Currently, all buffers (parameters, internals, gradients, ...) are assumed to have the dtype (typically either float or double). This is a bit restrictive: For example, in a max-pooling operation, one would like to store which cell in the current window has the maximum value (as discussed in #29). Something similar would happen in a Maxout-Layer, or when implementing a Top-K Autoencoder. I can work around this for the max-pooling OP, but in general it would be nice to be able to specify an optional dtype for each ShapeTemplate.

Inline documentation

How should we format the docstrings? I do like the numpy style, but PyCharm doesn't seem to pick up type hints from them.

Example Numpy (no pycharm support):

def foo(a, b):
    """
    One line summary.

    More detailed description

    Parameters
    ----------
    a : int
        Some number

    b : str
        Some text

    Returns
    -------
    str
        The repeated text
    """
    return "".join([b] * a)

Google style:

def foo(a, b):
    """
    One line summary.

    More detailed description

    Args:
        a (int): Some number
        b (str): Some text

    Returns:
        str: The repeated text
    """
    return "".join([b] * a)

Pycharm works fine with reStructuredText:

def foo(a, b):
    """
    One line summary.

    More detailed description

    :param a: Some number
    :type a: int
    :param b: Some text
    :type b: str

    :returns: the repeated text
    :rtype: str
    """
    return "".join([b] * a)

Epytext also works and is very similar except for the usage of @:

def foo(a, b):
    """
    One line summary.

    More detailed description

    @param a: Some number
    @type a: int
    @param b: Some text
    @type b: str

    @returns: the repeated text
    @rtype: str
    """
    return "".join([b] * a)

Sphinx should be able to support all of them.

Plotting functionality for monitors

Would it make sense to provide basic plotting functionality to monitors that output per-epoch results (i.e. the accuracy monitor)?

In one glance you would be able to obtain information about (speed of) convergence and performance.
Allows easier sharing of results from one researcher to another.
Would require matplotlib to be in place if the user chooses to make use of this functionality.

Allow slice-notation for layer wiring

This is a big feature that would allow you to do

inp = InputLayer(20)
outp = ForwardLayer(4)
inp[:10] >> ForwardLayer() >> outp
inp[10:] >> ForwardLayer() >> outp

Move layer size calculation out of python >> api

It should be part of the architecture post-processing

Describing and serializing objects

From what I can see, the mechanism for generating descriptions and serializing networks etc. does not fully work yet. @Qwlouse, you were working on this. Any comments on what else is needed?

I think we should have network.save() and trainer.save() methods for dumping to disk, and load_network() and load_trainer() functions for reading the dumps. It shouldn't be more complicated than this, I think. Thoughts?

Test context slice allows for continuing a forward pass

Trainer output "Stopping because:" should be controlable via some verbosity setting

MergeLayer

We need a layer that has multiple inputs and just concatenates them along the last feature dimension.
For CPU that one can be omitted, because the NumpyHandler supports slicing the features, but for usage with the PyCudaHandler this is the only way of merging the outputs of two layers.

Naming conventions for handler operations

We should have naming conventions to make it easier to remember and use operations provided by the handlers.

Suggested naming scheme:
t := tensor (array of any dimensionality)
m := matrix (2D array)
v := vector (1D array)
s := scalar
Prefix element-wise operations with elem_

Conventions: _ e.g.
elem_add_tt : adds 2 tensors element-wise
dot_mm : adds 2 matrices

Streams

Sooner or later, we should think about introducing CUDA streams for our GPU implementation. Side-Effect: Looking at the profiling outputs, across various example the most expensive call we make is usually the set_from_numpy call in the PyCudaHandler. We should be able to completely eliminate the cost of that call completely once we use streams, as the memory-transfers can all be done asynchronously (and we could finally implement a sensible double-buffering on GPUs).

I can think of two ways to add Streams:

Specify Stream for each Call
Add a stream=None optional argument to all the handler functions, so that the caller can specify the stream on which to execute. When the stream is not specified, we run on the default stream. We could pass either real cuda-streams, or just stream-IDs (integers). Calls would then maybe look like this:

    _h.dot_add_mm(dIa[t], x[t], dWi, transa=True, stream=_h.stream[1])
    _h.dot_add_mm(dFa[t], x[t], dWf, transa=True, stream=_h.stream[2])
    _h.dot_add_mm(dOa[t], x[t], dWo, transa=True, stream=_h.stream[3])
    _h.dot_add_mm(dZa[t], x[t], dWz, transa=True, stream=_h.stream[4])
    ...
    _h.synchronize_all_streams()

Add a separate function for specifying streams:

    _h.set_stream(1)
    _h.dot_add_mm(dIa[t], x[t], dWi, transa=True)
    _h.set_stream(2)
    _h.dot_add_mm(dFa[t], x[t], dWf, transa=True)
    _h.set_stream(3)
    _h.dot_add_mm(dOa[t], x[t], dWo, transa=True)
    _h.set_stream(4)
    _h.dot_add_mm(dZa[t], x[t], dWz, transa=True)
    ...
    _h.synchronize_all_streams()

In this short example, option 1 clearly looks better (IMO), but I can see option 2 working out nicely, too.

Another thing to consider is that we might set up some rules about streams. For example, something like "outputs should always be computed on streams 0-4"... or maybe it even makes sense to have different streams for outputs, internals and parameters, so we know which ones we need to synchronize on before starting computations in a new layer (or not, IDK).

PyCudaHandler is broken

The recent added functionality seems to have broken the PyCudaHandler (including the tests). Running the tests results in:

Traceback (most recent call last):
  File "/home/arkade/venv/py2/local/lib/python2.7/site-packages/_pytest/config.py", line 543, in importconftest
    mod = conftestpath.pyimport()
  File "/home/arkade/venv/py2/local/lib/python2.7/site-packages/py/_path/local.py", line 650, in pyimport
    __import__(modname)
  File "/home/arkade/Dropbox/codes/brainstorm/test/conftest.py", line 7, in <module>
    from brainstorm.structure.architecture import (
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/__init__.py", line 5, in <module>
    from brainstorm.structure import *
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/structure/__init__.py", line 4, in <module>
    from brainstorm.structure.network import Network
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/structure/network.py", line 11, in <module>
    from brainstorm.structure.buffers import BufferManager
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/structure/buffers.py", line 6, in <module>
    from brainstorm.handlers import default_handler
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/__init__.py", line 10, in <module>
    from brainstorm.handlers.pycuda_handler import PyCudaHandler
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/pycuda_handler.py", line 40, in <module>
    class PyCudaHandler(Handler):
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/pycuda_handler.py", line 79, in PyCudaHandler
    array_type = pycuda.gpuarray.GPUArray
NameError: name 'pycuda' is not defined

Changing L79 in pycuda_handler.py to array_type = gpuarray.GPUArray leads to the bigger problem:

Traceback (most recent call last):
  File "/home/arkade/venv/py2/local/lib/python2.7/site-packages/_pytest/config.py", line 543, in importconftest
    mod = conftestpath.pyimport()
  File "/home/arkade/venv/py2/local/lib/python2.7/site-packages/py/_path/local.py", line 650, in pyimport
    __import__(modname)
  File "/home/arkade/Dropbox/codes/brainstorm/test/conftest.py", line 7, in <module>
    from brainstorm.structure.architecture import (
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/__init__.py", line 5, in <module>
    from brainstorm.structure import *
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/structure/__init__.py", line 4, in <module>
    from brainstorm.structure.network import Network
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/structure/network.py", line 11, in <module>
    from brainstorm.structure.buffers import BufferManager
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/structure/buffers.py", line 6, in <module>
    from brainstorm.handlers import default_handler
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/__init__.py", line 10, in <module>
    from brainstorm.handlers.pycuda_handler import PyCudaHandler
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/pycuda_handler.py", line 627, in <module>
    _mod = SourceModule(__softmax_kernel_code)
  File "/home/arkade/venv/py2/local/lib/python2.7/site-packages/pycuda/compiler.py", line 262, in __init__
    self.module = module_from_buffer(cubin)
LogicError: cuModuleLoadDataEx failed: invalid device context -

Handler-specific storage requirements in layers

While implementing convolutions/pooling, I've stumbled upon the following problem: Depending on how you implement the operations, you might need more/less storage. Specifically, the GPU and CPU implementations might need different amounts of it / might not need it at all. Currently I have two examples for this:

cudnn needs a ton of descriptors for each convolution/pool-operation. These need to be allocated before calling the cudnn-convolution ops, and de-allocated right after. Which I'm assuming incurs some runtime cost -- haven't measured it though. I once asked an nvidia guy about it, he said the operations are pretty cheap, but I'm still assuming there's a malloc/free involved. These descriptors don't change within one layer, i.e., layerX will always construct/deallocate the same descriptors over and over. Additionally, some cudnn-convolution operations use a "workspace" memory that currently also needs to be allocated/deallocated after each conv-operation.
When implementing max-pooling, one could remember the "argmax" (i.e., which coordinate in the current window is the maximum) during the forward path, and re-use that information in the backward pass. This would significantly cut down runtime on the backward pass, at the cost of some more memory. However we can only do this in the CPU-version (since we can't change cudnn, who apparently doesn't follow this scheme).

In both cases, one of the two handlers needs additional storage, while the other doesn't. What's even weirder: the argmax can be seen as a buffer, and could be handled by the buffer manager. However, that'd lead to wasting memory on the GPU, where we'd allocate the buffer but never use it (which might also be confusing to users who inspect these buffers expecting them to mean something). The descriptors OTOH are cudnn-specific structures and probably not meant to be stored in buffers.

I can think of twho solutions

Add something like handler.allocate_pool/conv_specific_memory(...) that returns some sort of opaque datastructure (maybe a list of descriptors, allocations), which is then stored within each layer and always passed to the conv/pooling methods...

  # in layer ctor:
  self._pooling_data = self.handler.allocate_pool_specific_memory()`

  # in forward path
  def forward_pass(...):
        # each specific handler implementation is free to ignore the last argument if he doesn't need it
        self.handler.conv2d_forward_batch(inputs, window, outputs, pad, stride, self._pooling_data)

Allocate/deallocate cudnn-specific stuff in each call, and make "argmax" an internal buffer of the pooling layer

I'm not superhappy with either solution, since both are slightly ugly. I like solution 1 slightly more, but it has the additional problem of making the API a bit more complicated. What do you guys think?

Debug handler

should wrap over any other handler and provide lots of checking

Check for invalid values using DebugHandler

Allow deep copies of BufferView

It will be useful to allow deepcopy-ing a BufferView e.g. during tests and introspection. The BufferView class currently does not allow this.

Monitors always write to stdout

Currently, all monitors write to stdout. If brainstorm is used from an IPython notebook, and some monitor as an update interval, this will inevitably lead to a completely frozen browser session, as IPython notebooks usually cannot deal with the massive amount of output produced by many monitors.

It would be nice if the user could set where each monitors writes its output. sys.stdout is a sensible default, but for many applications it makes more sense to log to a file instead. Ideally this setting could be changed on a per-monitor basis, where the destination (or whatever one wants to call it) parameter could either be a file-like object or a string denoting a filename.

(Optionally, we could still print to stdout and a file if verbose==True, and just to the file if verbose==False)

Support convolutional layers with batch normalization

Extend layer tests

Layers should be testing for various supported activation functions (instead of just defaults).
Additionally, we need to make sure we are testing all layers equally extensively.

Minibatches Iterator should cut sequences according to mask

Describing converts tuples to lists

This happens for things like kernel sizes, strides etc. If possible, this behavior should be avoided.

Difference between constant-size buffers and parameters

For the trainer it is important to have a common buffer for all the parameters and one for all the gradients. Right now I solved the problem by introducing a separate view to the main constant-sized buffer directly in the BufferManager:

    net.buffer.parameters
    net.buffer.gradients

But that is problematic, because not every constant-sized view is also a parameter. Only views inside a parameters category should go into that view. We could easily solve that problem without changing the layout code much by providing a fake layer called parameters. In the BufferManager it would then look like that:

net.buffer.forward.parameters   # all parameters
net.buffer.backward.parameters  # all gradients

The only objection against that could be that it is confusing to have this fake layer.

Merge forward and backward buffers

I wanted to continue the discussion about merging the forward- and backward buffers into a single buffer. The main advantage I see in this is that the code will be easier to read: whenever I show people brainstorm code, one of the first questions that pops up is "what do these buffers mean/why are there 2 of them" -- admittedly my sample size so far is 2 (I myself was also confused, so let's make it 3 ;) ). Also a pattern that emerges often within the code is something along the lines of:

    Ha = forward_buffers.internals.Ha
    dHa = backward_buffers.internals.Ha

Which indicates, to me at least, that the true name of backward_buffers.internals.Ha is should actually be buffers.internals.dHa, anyways.

If we implement this change, there are two ways to go:

Explicitly list the 'backward buffers' in the get_internal_structure:

def get_internal_structure(self):
    internals = OrderedDict()
    internals['Ha'] = ShapeTemplate('T', 'B', self.size)
    internals['dHa'] = ShapeTemplate('T', 'B', self.size)
    return internals

Have a flag for each buffer requested via get_internal_structure, which indicates whether we need a "backward" buffer as well. In that case, just append a "d" to the name of the "forward buffer" we created. e.g.:
```
def get_internal_structure(self):
    internals = OrderedDict()
    internals['Ha'] = ShapeTemplate('T', 'B', self.size, needs_gradient_buffer=True)
    return internals
```

The advantage of the 2nd approach is that it would still be fairly easy to lazily allocate backward buffers, and that it's less wordy. The downsize is that it might be a bit "magical". Which is why I actually prefer the first approach, even though it wastes memory and requires more typing. Typically, if you have enough memory the run a forward path, you'll have enough for the backward path as well, IMO.

Anyways... thoughts?

Implementing and testing custom layers

IMHO a strong point of brainstorm is that you can very easily implement your own layer, and distribute it to others without having to fork the library. For that process it'd be very helpful to have a convenient way of running your custom layer against all the layer-tests.
I think that should not be hard to implement and would provide a lot of value.

double buffering

For double buffering to work properly, the new strategy of copying data to device in provide_external_data instead of the iterators requires some changes to how InputLayer works. This is a TODO, but it looks like there is another problem:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/training/trainer.py", line 157, in run_it
    net.provide_external_data(next(it))
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/structure/network.py", line 290, in provide_external_data
    self.handler.set_from_numpy(buf, data[name])
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/pycuda_handler.py", line 73, in set_from_numpy
    mem.set(arr.astype(self.dtype))
  File "/home/arkade/venv/py2/local/lib/python2.7/site-packages/pycuda/gpuarray.py", line 243, in set
    _memcpy_discontig(self, ary, async=async, stream=stream)
  File "/home/arkade/venv/py2/local/lib/python2.7/site-packages/pycuda/gpuarray.py", line 1190, in _memcpy_discontig
    drv.memcpy_htod(dst.gpudata, src)
LogicError: cuMemcpyHtoD failed: invalid device context

Add forward and backward state

similiar to param buffer

change naming conventions for layers

Right now the construction layers are named the same as the actual layer implementations.
This leads to confusion and should be changed.

Implement PyCuda Handler operations needed for the forward layer

Rmsprop, Adadelta, Adam steppers

Convert tests to use specific DebugHandler

Recurrent layer weight semantics

Should make sure that each row in the recurrent weight matrix is used as input weights to a single neuron.

Change the signature of `copy_to`

Currently the signature is copy_to(dest, src), following the numpy.copy_to signature. But the ordering is confusing. I vote for changing it to copy_to(src, dest) to make it more intuitive, and because I don't think it is that important to be consistent with this numpy method (which is rarely used anyways).
If you think that would still be confusing we could rename the method to more clearly deviate from the numpy one.