ufal / neuralmonkey Goto Github PK

An open-source tool for sequence learning in NLP built on TensorFlow.

License: BSD 3-Clause "New" or "Revised" License

Python 91.83% Shell 0.52% CSS 1.54% HTML 0.35% JavaScript 0.23% Perl 5.03% Makefile 0.29% Mathematica 0.21%

neural-machine-translation tensorflow nlp sequence-to-sequence python neural-networks nmt machine-translation mt deep-learning

neuralmonkey's Introduction

Neural Monkey

Neural Sequence Learning Using TensorFlow

The Neural Monkey package provides a higher level abstraction for sequential neural network models, most prominently in Natural Language Processing (NLP). It is built on TensorFlow. It can be used for fast prototyping of sequential models in NLP which can be used e.g. for neural machine translation or sentence classification.

The higher-level API brings together a collection of standard building blocks (RNN encoder and decoder, multi-layer perceptron) and a simple way of adding new building blocks implemented directly in TensorFlow.

Usage

neuralmonkey-train <EXPERIMENT_INI>
neuralmonkey-run <EXPERIMENT_INI> <DATASETS_INI>
neuralmonkey-server <EXPERIMENT_INI> [OPTION] ...
neuralmonkey-logbook --logdir <EXPERIMENTS_DIR> [OPTION] ...

Installation

You need Python 3.6 (or higher) to run Neural Monkey.

When using virtual environment, execute these commands to install the Python dependencies:

$ source path/to/virtualenv/bin/activate

# For GPU-enabled version
(virtualenv)$ pip install --upgrade -r requirements-gpu.txt

# For CPU-only version
(virtualenv)$ pip install --upgrade -r requirements.txt

If you are using the GPU version, make sure that the LD_LIBRARY_PATH environment variable points to lib and lib64 directories of your CUDA and CuDNN installations. Similarly, your PATH variable should point to the bin subdirectory of the CUDA installation directory.
If the training crashes on an unknown dependency, just install it with pip. Remember to keep your virtual environment up-to-date with the package requirements file, which may be changed over time. To update the dependencies, re-run the pip install command from above (pay attention to the distinction between GPU and non-GPU versions).

Getting Started

There is a tutorial that you can follow, which gives you the overwiev of how to design your experiments with Neural Monkey.

Package Overview

bin: Directory with neuralmonkey executables
examples: Example configuration files for ready-made experiments
lib: Third party software
neuralmonkey: Python package files
scripts: Directory with tools that may come in handy. Note dependencies for these tools may not be listed in the project requirements.
tests: Test files

Documentation

You can find the API documentation of this package here. The documentation files are generated from docstrings using autodoc and Napoleon extensions to the Python documentation package Sphinx. The docstrings should follow the recommendations in the Google Python Style Guide. Additional details on the docstring formatting can be found in the Napoleon documentation as well.

Related projects

tflearn – a more general and less abstract deep learning toolkit built over TensorFlow
nlpnet – deep learning tools for tagging and parsing
NNBlocks – a library build over Theano containing NLP specific models
Nematus - A tool for training and running Neural Machine Translation models
seq2seq - a general-purpose encoder-decoder framework for Tensorflow
OpenNMT - open sourcce NMT in Torch

Citation

If you use the tool for academic purporses, please consider citing the following paper:

@article{NeuralMonkey:2017,
    author = {Jind{\v{r}}ich Helcl and Jind{\v{r}}ich Libovick{\'{y}}},
    title = {{Neural Monkey: An Open-source Tool for Sequence Learning}},
    journal = {The Prague Bulletin of Mathematical Linguistics},
    year = {2017},
    address = {Prague, Czech Republic},
    number = {107},
    pages = {5--17},
    issn = {0032-6585},
    doi = {10.1515/pralin-2017-0001},
    url = {http://ufal.mff.cuni.cz/pbml/107/art-helcl-libovicky.pdf}
}

License

The software is distributed under the BSD License.

neuralmonkey's People

Contributors

Stargazers

Watchers

Forkers

wanjinchang qgzang alvaz16 archerbroler jernejvicic rihardsk juliakreutzer cifkao vaclavpavlicek bjornej pmahnke xpertasks m4t1ss himmelstein bastings jinyu0310 ilyeong-ai chagge ieswxia 17603867116 abhishekkodi zofuthan dchichkov johndpope jadesoul codeaudit chinix standy66 zabin10 carolinlawrence mapcloud h312h lijielife chqiwang mlappe p63gonome3 miradel51 premjithb nicetester sanjeeku stevenlol rahulrawat11 mhany90 ht-dep adfors chenghuige gwenniger ajbloureiro soumye dromescu cosecant-csc davidtranno1 anoopkunchukuttan ideafactorylabs youlei5898 bdgowda1 davidmarecek hyperparticle leechung shubhampachori12110095 linxxx adhocmaster littlebadrobot fei090620 simon-will maxblunck alphacyc weiczhu indigos33k3r shj1987 alphadl a515151 jrjdr presleyhank puerjing poethan hzlihang99 ashtraysoap deeprnd sahwar nightfury0126 jiajiadf digger3d odiezapha connietong shaoyandea dilyar-iskan jiezhanggt kavithacd celestialized a2un muluken2 rahulrajpv

neuralmonkey's Issues

Web service should log all trafic

IP adresses, queries, results... all of this should be stored in some files.

Minimum risk training

Implement minimum risk trainer as described in http://arxiv.org/abs/1512.02433.

sampling is done by computing top-k translations (beam search is already done)

Ensembles

Write support for ensemble models.

The idea is, in the end, to give the running script multiple *.ini files (or one with links to another experiments).

Move to Python 3.5

Prevent colissions of encoders/decoders variables

I you attempt to create more encoders and do not provide its name, it will crash later on collision of the variable scopes. I would suggest a mechanism (probably in utils.py) that would be always asked to get a name and append a number if there would be a collision.

Discussion on configuration

@jlibovicky, in #15 you mentioned that it would be hard to generate an ini file. Why do we need to do that?

I'm not quite sure what is the design (if any) of the configuration manipulation. I thought that there is an ini file, that gets parsed into an abstract representation (some terrible Python object), then we build a computation graph according to the representation and run it. Is there anything else happening?

Variables that are saved are not the best

Variables are saved only if they are the best so far. They should be saved whenever the score makes it to top-n scores.

Global random_seed

Can we define one random_seed in the top level of the configuration, that will be used everywhere?

What is the goal of this package?

I'm not quite sure what are we trying to achieve here. What is the goal of this package? How does it differ from tflearn and similar frameworks? Are we writing something that is already done somewhere? If not, what is new here?

These questions should be clearly answered in the README, if we want anybody to use this.

unstated dependencies

neuralmonkey/estimate_scheduled_sampling.py is depending on scipy. Scipy is quite a big dependency, I'd hate to install it just for the one function. Can we do something about this?

Make sure histogram gradients are drawn in TensorBoard

Batch size configuration

Why does batch size in ini files appear both in [main] and [runner]? Which one is used?

Values in <> disappeared in LogBook

It is probably and encoding issue connected to transfer to Python 3.5. Values in <> get ignored in when logbook serves the ini files.

Cannot import 'Levenshtein'

Cannot import 'Levenshtein' when trying to run pylint on evaluation.py. This means that a package is missing in requirements.txt.

Lazy dataset should have a config method

Lazy dataset should have a similar building function as the standard (in-memory) dataset in the config module. Moreover, its __init__ method should be refactored the same way.

Do not catch general exceptions in config_loader

When you catch general exceptions there, every exception (eg. import error) from the module that you want something from gets caught and the the error message about non-existence of something that is clearly there is quite confusing. Btw. you should never just catch general exceptions. Fix this after I finish #4.

Add option to hide raw reference and output

Since 5a4498, original (meaning not post-processed) decoded output and pre-processed reference is shown in the validation log. This makes the validation output twice as large and ultimately more hideous. But it's useful when debugging pre- and postprocessing.

Multiple decoders

We should be able to create models that have more decoders at the same time. E.g. that would classify a sentence and output a sequence at the same time.

Lazy dataset will use wildcards

Often data are split into more files. Lazy dataset which is designed for loading bigger datasets should be able to get a list of files / wildcards specifying the files it will read.

Sequence labeler as another decoder

Things that we need to refactor

While I try to make #6 happen, I find many issues that I am not capable/willing to address. I will maintain a checklist of what needs to be done here. Btw. you would not believe how much code that clearly is not working (random.random > 0.5, unused variables, ...) I've encountered so far.

tests culture

Test scripts should be moved away from the root directory. Also, why there is two files, tests_run.sh and run_tests.sh and what is their purpose? This should be all done in the tests directory. Also with unit-tests_run.sh, mypy_run.sh, lint_run.sh and others.

Also, one of the run tests scripts (tests_run i think) should use the -P (or --directory-prefix) option of wget instead of cd-ing there and back again. The test-output directory should be generated somewhere else than in the root of the repository, preferable in a temporary location either in /tmp or in a tmp dir in subdirectory of tests.

Documentation

Create documentation from docstrings
Publish it automatically

The lib directory

Should subword_nmt be a submodule? Are we doing the imports right?

Beamsearch does not work if no ground truth is provided

Implementation of beam search relies on placeholders which are fed fed nothing if the ground thruth sequence is not provided.

Upgrade to Tensorflow 0.9

This should be mainly rewriting tensorflow.python.ops.rnn* to tf.nn.rnn*.

A second column for comparing logs in LogBook

Should we freeze dependencies?

If we do not have concrete versions of dependencies in requirements, things like error in #54 might be happening from time to time. On the other hand, if we freeze the dependencies, we should check for updates from time to time, which means more work. I'm leaning towards automatic updates and letting the build fail from time to time. We test things fairly regularly now, so we should be able to catch and repair breaking changes. What do you think?

Run as a web service

Once we have the run.py script it should be extremely easy using flask (which is already a dependency). It will receive a dictionary of dataset series (the same way we have right now) as JSON and send back a JSON with outputs and some statistics.

Syntax highligh for ini in LogBook

Pretrained word embeddings

update documentation and helper scripts to TF 0.9

This should have been done in the tf9 branch before the merge.

Do not list enocoders in [main] configuration

Now, the encoders are listed multiple times in the configuration: in the main configuration and as the arguments of a decoder. Duplication is a frequent source of errors, and therefore it should be only in the decoder.

Evaluation functions

Evaluation functions should be refactored in callable and comparable objects to simplify the training loop function. They can also define their own name so the output in the log need not to be the name of the function.

Fix random seeds

Right now we can specify a random seed in configuration, but it does not work.

code review

There are many code review tools integrated with GitHub (eg. Reviewable). Should we use one of them in our workflow?

Logbook shows no expirements on tests

When I run bin/neuralmonkey-logbook --port 5050 --logdir tests/tmp-test-output, I get a screen with "click experiment on the left", but there is no experiment on the left. Why is that?

Write our own BLEU score

... and get rid of NLTK, whom I don't believe a single line of code.

imagenet_synset_words.txt

What is imagenet_synset_words.txt doing in the repo?

Wrong symlink in best variables

Commit add9bdc (fix saving variables) introduced a bug. It creates a symlink with wrong relative address. For example, when I set my output to directory test-out, it creates a link to test-out/data.whatever in that folder. So the script is looking for test-out/test-out/data.whatever instead of test-out/data.whatever.

My preferred solution would be to put the commit into a separate branch (rewinding current master by one commit) and merge #13. That will enable running tests/small.ini on Travis. The new branch can be merged when the bug is fixed. What do you think, @jindrahelcl?

Create a package hieararchy

We need to:

create a reasonable directory structure
create __init__.py file in every directory
rewrite relative imports (eg. import neuralmonkey.vocabulary instead of import vocabulary)
enable relative import warnings in .pylintrc

I've already done this for the tests/python directory, I hope it does not break anything.

License

Since we are hosting this publicly, it should have a license. I personally like MIT or BSD3.

logbook

Why is this logbook thing in master? Is it done or is it work-in-progress? If it's done, flask should be a dependency. It is not used anywhere, is it a stand-alone tool? Then it should be documented somewhere.

Create and maintain specification for the configuration

Create a functional demo on quest

With webservice ready, it is time to set up something like this:
http://quest.ms.mff.cuni.cz/moses/demo.php

Here's the checklist

Also, creating a new label for issues related to the web service.

Package executable

The package should somehow provide an executable. Right now the training may be executed with python -u -m neuralmonkey.train whatever.ini, maybe we should document this somewhere until we find a better way to do this. One of us will have to learn how to manage a proper Python package.

Printing the experiment logs to a file is broken.

Get rid of depricated getargspec() in config_loader.py

Postprocessing bug

When I ran my tests/small.ini configuration, it failed on some error with lambda wanting two arguments, but just one was given. I solved this in my branch by removing the second (unused) argument of the lambda on line 36 in learning_utils, but I'm not sure whether this does not break anything else. Can you have a look at this, @jindrahelcl?

Edit: The correct lambda was on line 155, but maybe they should be the same?

Run pylint on everything

We should run pylint on everything. For easy automatic checking, every file should have 10/10 score. To achieve this, you may have to locally disable some warnings (# pylint: disable=...) – use this only if it is really necessary. After you eliminate all errors and warnings, add this line to the file:

# tests: lint

All files containing this line are checked with pylint by lint_run.sh, which you should always run before you commit anything and which is automatically run on Travis CI after you push to Github.

You can see the list of files that have not been checked yet with test_status.sh. This issue will be closed when that list is empty.

ufal / neuralmonkey Goto Github PK

neuralmonkey's Introduction

Neural Monkey

Usage

Installation

Getting Started

Package Overview

Documentation

Related projects

Citation

License

neuralmonkey's People

Contributors

Stargazers

Watchers

Forkers

neuralmonkey's Issues

Recommend Projects

Recommend Topics

Recommend Org