Giter Club home page Giter Club logo

neuralmonkey's Introduction

  Ape is not a monkey Neural Monkey

Neural Sequence Learning Using TensorFlow

Build Status Documentation Status

The Neural Monkey package provides a higher level abstraction for sequential neural network models, most prominently in Natural Language Processing (NLP). It is built on TensorFlow. It can be used for fast prototyping of sequential models in NLP which can be used e.g. for neural machine translation or sentence classification.

The higher-level API brings together a collection of standard building blocks (RNN encoder and decoder, multi-layer perceptron) and a simple way of adding new building blocks implemented directly in TensorFlow.

Usage

neuralmonkey-train <EXPERIMENT_INI>
neuralmonkey-run <EXPERIMENT_INI> <DATASETS_INI>
neuralmonkey-server <EXPERIMENT_INI> [OPTION] ...
neuralmonkey-logbook --logdir <EXPERIMENTS_DIR> [OPTION] ...

Installation

  • You need Python 3.6 (or higher) to run Neural Monkey.

  • When using virtual environment, execute these commands to install the Python dependencies:

    $ source path/to/virtualenv/bin/activate
    
    # For GPU-enabled version
    (virtualenv)$ pip install --upgrade -r requirements-gpu.txt
    
    # For CPU-only version
    (virtualenv)$ pip install --upgrade -r requirements.txt
  • If you are using the GPU version, make sure that the LD_LIBRARY_PATH environment variable points to lib and lib64 directories of your CUDA and CuDNN installations. Similarly, your PATH variable should point to the bin subdirectory of the CUDA installation directory.

  • If the training crashes on an unknown dependency, just install it with pip. Remember to keep your virtual environment up-to-date with the package requirements file, which may be changed over time. To update the dependencies, re-run the pip install command from above (pay attention to the distinction between GPU and non-GPU versions).

Getting Started

There is a tutorial that you can follow, which gives you the overwiev of how to design your experiments with Neural Monkey.

Package Overview

  • bin: Directory with neuralmonkey executables

  • examples: Example configuration files for ready-made experiments

  • lib: Third party software

  • neuralmonkey: Python package files

  • scripts: Directory with tools that may come in handy. Note dependencies for these tools may not be listed in the project requirements.

  • tests: Test files

Documentation

You can find the API documentation of this package here. The documentation files are generated from docstrings using autodoc and Napoleon extensions to the Python documentation package Sphinx. The docstrings should follow the recommendations in the Google Python Style Guide. Additional details on the docstring formatting can be found in the Napoleon documentation as well.

Related projects

  • tflearn – a more general and less abstract deep learning toolkit built over TensorFlow

  • nlpnet – deep learning tools for tagging and parsing

  • NNBlocks – a library build over Theano containing NLP specific models

  • Nematus - A tool for training and running Neural Machine Translation models

  • seq2seq - a general-purpose encoder-decoder framework for Tensorflow

  • OpenNMT - open sourcce NMT in Torch

Citation

If you use the tool for academic purporses, please consider citing the following paper:

@article{NeuralMonkey:2017,
    author = {Jind{\v{r}}ich Helcl and Jind{\v{r}}ich Libovick{\'{y}}},
    title = {{Neural Monkey: An Open-source Tool for Sequence Learning}},
    journal = {The Prague Bulletin of Mathematical Linguistics},
    year = {2017},
    address = {Prague, Czech Republic},
    number = {107},
    pages = {5--17},
    issn = {0032-6585},
    doi = {10.1515/pralin-2017-0001},
    url = {http://ufal.mff.cuni.cz/pbml/107/art-helcl-libovicky.pdf}
}

License

The software is distributed under the BSD License.

neuralmonkey's People

Contributors

cifkao avatar hajicj avatar hyperparticle avatar jindrahelcl avatar jlibovicky avatar juliakreutzer avatar kocmitom avatar kvapili avatar martinpopel avatar obo avatar ozancaglayan avatar rihardsk avatar simon-will avatar tomasmcz avatar tuetschek avatar vaclavpavlicek avatar varisd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

neuralmonkey's Issues

Ensembles

Write support for ensemble models.

The idea is, in the end, to give the running script multiple *.ini files (or one with links to another experiments).

Prevent colissions of encoders/decoders variables

I you attempt to create more encoders and do not provide its name, it will crash later on collision of the variable scopes. I would suggest a mechanism (probably in utils.py) that would be always asked to get a name and append a number if there would be a collision.

Discussion on configuration

@jlibovicky, in #15 you mentioned that it would be hard to generate an ini file. Why do we need to do that?

I'm not quite sure what is the design (if any) of the configuration manipulation. I thought that there is an ini file, that gets parsed into an abstract representation (some terrible Python object), then we build a computation graph according to the representation and run it. Is there anything else happening?

Global random_seed

Can we define one random_seed in the top level of the configuration, that will be used everywhere?

What is the goal of this package?

I'm not quite sure what are we trying to achieve here. What is the goal of this package? How does it differ from tflearn and similar frameworks? Are we writing something that is already done somewhere? If not, what is new here?

These questions should be clearly answered in the README, if we want anybody to use this.

unstated dependencies

neuralmonkey/estimate_scheduled_sampling.py is depending on scipy. Scipy is quite a big dependency, I'd hate to install it just for the one function. Can we do something about this?

Cannot import 'Levenshtein'

Cannot import 'Levenshtein' when trying to run pylint on evaluation.py. This means that a package is missing in requirements.txt.

Lazy dataset should have a config method

Lazy dataset should have a similar building function as the standard (in-memory) dataset in the config module. Moreover, its __init__ method should be refactored the same way.

Do not catch general exceptions in config_loader

When you catch general exceptions there, every exception (eg. import error) from the module that you want something from gets caught and the the error message about non-existence of something that is clearly there is quite confusing. Btw. you should never just catch general exceptions. Fix this after I finish #4.

Add option to hide raw reference and output

Since 5a4498, original (meaning not post-processed) decoded output and pre-processed reference is shown in the validation log. This makes the validation output twice as large and ultimately more hideous. But it's useful when debugging pre- and postprocessing.

Multiple decoders

We should be able to create models that have more decoders at the same time. E.g. that would classify a sentence and output a sequence at the same time.

Lazy dataset will use wildcards

Often data are split into more files. Lazy dataset which is designed for loading bigger datasets should be able to get a list of files / wildcards specifying the files it will read.

Things that we need to refactor

While I try to make #6 happen, I find many issues that I am not capable/willing to address. I will maintain a checklist of what needs to be done here. Btw. you would not believe how much code that clearly is not working (random.random > 0.5, unused variables, ...) I've encountered so far.

  • processors/bpe.py classes are questionable here.
  • processors/german.py classes are questionable
  • utils.py This needs to be a class!
  • logging.py various
  • decoding_function.py There are lots of arguments, some unused. This needs a complete refactor.
  • mlp.py This really should not be a class. Also, aren't there any implementations of multilayered perceptron that we can use?
  • readers/plain_text_reader.py This should not be a class.
  • config/config_loader.py general exceptions, see #12
  • config/configuration.py general exceptions, see #12
  • config/config_generator.py I think this should be abandoned, see #17
  • encoders/sentence_encoder.py crazy big object, 13 parameters...
  • encoders/image_encoder.py dtto
  • encoders/cnn_encoder.py dtto
  • bidirectional_rnn_layer.py class is questionable
  • tokenize_data.py oh the horror
  • decoders/sequence_classifier.py too many instance attributes
  • decders/decoder.py big and ugly object
  • image_utils.py questionable class
  • prepare_str_images.py general exception
  • precompute_image_features.py too long, break it up
  • trainers/copynet.py undefined variable!
  • trainers/cross_entropy_trainer.py questionable class
  • trainers/mixer.py various errors
  • decompound_truecase.py javabridge
  • runners/runner.py questionable class
  • runners/beamsearch.py questionable class
  • runners/copynet_runner.py undefined variable!
  • runners/perplexity.py questionable class
  • logbook/logbook.py dependencies
  • cells/noisy_gru_cells.py various
  • learning_utils.py pure evil – half-screen levels of indentation
  • caffe_image_features.py imports and other things
  • lazy_dataset.py argument numbers, non-existent members
  • reformat_downloaded_image_features.py imports

tests culture

Test scripts should be moved away from the root directory. Also, why there is two files, tests_run.sh and run_tests.sh and what is their purpose? This should be all done in the tests directory. Also with unit-tests_run.sh, mypy_run.sh, lint_run.sh and others.

Also, one of the run tests scripts (tests_run i think) should use the -P (or --directory-prefix) option of wget instead of cd-ing there and back again. The test-output directory should be generated somewhere else than in the root of the repository, preferable in a temporary location either in /tmp or in a tmp dir in subdirectory of tests.

Documentation

  • Create documentation from docstrings
  • Publish it automatically

The lib directory

Should subword_nmt be a submodule? Are we doing the imports right?

Should we freeze dependencies?

If we do not have concrete versions of dependencies in requirements, things like error in #54 might be happening from time to time. On the other hand, if we freeze the dependencies, we should check for updates from time to time, which means more work. I'm leaning towards automatic updates and letting the build fail from time to time. We test things fairly regularly now, so we should be able to catch and repair breaking changes. What do you think?

Run as a web service

Once we have the run.py script it should be extremely easy using flask (which is already a dependency). It will receive a dictionary of dataset series (the same way we have right now) as JSON and send back a JSON with outputs and some statistics.

Do not list enocoders in [main] configuration

Now, the encoders are listed multiple times in the configuration: in the main configuration and as the arguments of a decoder. Duplication is a frequent source of errors, and therefore it should be only in the decoder.

Evaluation functions

Evaluation functions should be refactored in callable and comparable objects to simplify the training loop function. They can also define their own name so the output in the log need not to be the name of the function.

Fix random seeds

Right now we can specify a random seed in configuration, but it does not work.

code review

There are many code review tools integrated with GitHub (eg. Reviewable). Should we use one of them in our workflow?

Logbook shows no expirements on tests

When I run bin/neuralmonkey-logbook --port 5050 --logdir tests/tmp-test-output, I get a screen with "click experiment on the left", but there is no experiment on the left. Why is that?

Wrong symlink in best variables

Commit add9bdc (fix saving variables) introduced a bug. It creates a symlink with wrong relative address. For example, when I set my output to directory test-out, it creates a link to test-out/data.whatever in that folder. So the script is looking for test-out/test-out/data.whatever instead of test-out/data.whatever.

My preferred solution would be to put the commit into a separate branch (rewinding current master by one commit) and merge #13. That will enable running tests/small.ini on Travis. The new branch can be merged when the bug is fixed. What do you think, @jindrahelcl?

Create a package hieararchy

We need to:

  • create a reasonable directory structure
  • create __init__.py file in every directory
  • rewrite relative imports (eg. import neuralmonkey.vocabulary instead of import vocabulary)
  • enable relative import warnings in .pylintrc

I've already done this for the tests/python directory, I hope it does not break anything.

License

Since we are hosting this publicly, it should have a license. I personally like MIT or BSD3.

logbook

Why is this logbook thing in master? Is it done or is it work-in-progress? If it's done, flask should be a dependency. It is not used anywhere, is it a stand-alone tool? Then it should be documented somewhere.

Create a functional demo on quest

With webservice ready, it is time to set up something like this:
http://quest.ms.mff.cuni.cz/moses/demo.php

Here's the checklist

  • Ask for a VM on quest
  • Install monkey on the VM
  • Run the service on the VM
  • Create a simple webpage with input form
  • Make it work
  • Write documentation

Also, creating a new label for issues related to the web service.

Package executable

The package should somehow provide an executable. Right now the training may be executed with python -u -m neuralmonkey.train whatever.ini, maybe we should document this somewhere until we find a better way to do this. One of us will have to learn how to manage a proper Python package.

Postprocessing bug

When I ran my tests/small.ini configuration, it failed on some error with lambda wanting two arguments, but just one was given. I solved this in my branch by removing the second (unused) argument of the lambda on line 36 in learning_utils, but I'm not sure whether this does not break anything else. Can you have a look at this, @jindrahelcl?

Edit: The correct lambda was on line 155, but maybe they should be the same?

Run pylint on everything

We should run pylint on everything. For easy automatic checking, every file should have 10/10 score. To achieve this, you may have to locally disable some warnings (# pylint: disable=...) – use this only if it is really necessary. After you eliminate all errors and warnings, add this line to the file:

# tests: lint

All files containing this line are checked with pylint by lint_run.sh, which you should always run before you commit anything and which is automatically run on Travis CI after you push to Github.

You can see the list of files that have not been checked yet with test_status.sh. This issue will be closed when that list is empty.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.