helsinki-nlp / fotranmt Goto Github PK

This project forked from opennmt/opennmt-py

Open Source Neural Machine Translation in PyTorch

License: MIT License

Python 89.38% Shell 3.49% Perl 4.34% Smalltalk 0.22% Emacs Lisp 2.02% JavaScript 0.10% NewLisp 0.19% Ruby 0.20% Slash 0.03% SystemVerilog 0.02%

fotranmt's People

Contributors

Stargazers

Watchers

Forkers

chrishokamp ales-t gastron jrvc mmalewski bjargud

fotranmt's Issues

Cuda runtime error

Hello, Getting this error in the middle of the training. I have tried training several times but the same error comes at different steps. The issue starts at validation steps as I trace back the error.
@raganato

/opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/THCTensorIndex.cu:360: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [5,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/THCTensorIndex.cu:360: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [5,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
.
.
.


Traceback (most recent call last):
  File "train.py", line 40, in <module>
    main(opt)
  File "train.py", line 27, in main
    single_main(opt)
  File "/home/aki/OpenNMT-py/onmt/train_single.py", line 262, in main
    opt.valid_steps)
  File "/home/aki/OpenNMT-py/onmt/trainer.py", line 223, in train
    report_stats)
  File "/home/aki/OpenNMT-py/onmt/trainer.py", line 384, in _gradient_accumulation
    dec_state)
  File "/home/aki/anaconda3/envs/zeroshot/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/aki/OpenNMT-py/onmt/models/model.py", line 75, in forward
    memory_lengths=lengths)
  File "/home/aki/anaconda3/envs/zeroshot/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/aki/OpenNMT-py/onmt/decoders/decoder.py", line 139, in forward
    tgt, memory_bank, state, memory_lengths=memory_lengths)
  File "/home/aki/OpenNMT-py/onmt/decoders/decoder.py", line 350, in _run_forward_pass
    memory_lengths=memory_lengths)
  File "/home/aki/anaconda3/envs/zeroshot/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/aki/OpenNMT-py/onmt/modules/global_attention.py", line 181, in forward
    mask = sequence_mask(memory_lengths, max_len=align.size(-1))
  File "/home/aki/OpenNMT-py/onmt/utils/misc.py", line 23, in sequence_mask
    .type_as(lengths)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/generic/THCTensorCopy.cpp:21

Neural interlingua: Torch not enough memory after 8 hours

Hello,

I'm experimenting with my PR branch of neural-interlingua. After 8 hours of non-problematic training this happened:

[2018-11-07 21:47:35,402 INFO] Current language pair: ('de', 'cs')
[2018-11-07 21:47:35,433 INFO] Loading valid dataset from all_pairs_preprocessed/de-cs/data.valid.1.pt, number of examples: 1014
[2018-11-07 21:47:36,350 INFO] Validation perplexity: 6.25632
[2018-11-07 21:47:36,351 INFO] Validation accuracy: 62.3635
[2018-11-07 21:48:45,551 INFO] >> BLEU = 23.28, 55.7/29.4/17.9/11.0 (BP=0.976, ratio=0.977, hyp_len=10099, ref_len=10342)
[2018-11-07 21:48:45,555 INFO] Current language pair: ('fr', 'cs')
[2018-11-07 21:48:45,581 INFO] Loading valid dataset from all_pairs_preprocessed/fr-cs/data.valid.1.pt, number of examples: 1014

[2018-11-07 21:48:46,478 INFO] Validation perplexity: 5.42896
[2018-11-07 21:48:46,478 INFO] Validation accuracy: 64.3625
Traceback (most recent call last):
  File "../train.py", line 40, in <module>
    main(opt)
  File "../train.py", line 27, in main
    single_main(opt)
  File "/lnet/spec/work/people/machacek/neural-interlingua/OpenNMT-py-dm/onmt/train_single.py", line 239, in main
    opt.valid_steps)
  File "/lnet/spec/work/people/machacek/neural-interlingua/OpenNMT-py-dm/onmt/trainer.py", line 271, in train
    parser = argparse.ArgumentParser(prog = 'translate.py',
  File "/lnet/spec/work/people/machacek/neural-interlingua/OpenNMT-py-dm/onmt/translate/translator.py", line 43, in build_translator
    model = onmt.model_builder.load_test_multitask_model(opt)
  File "/lnet/spec/work/people/machacek/neural-interlingua/OpenNMT-py-dm/onmt/model_builder.py", line 133, in load_test_multitask_model
    map_location=lambda storage, loc: storage)
  File "/lnet/spec/work/people/machacek/neural-interlingua/p2-onmt/local/lib/python2.7/site-packages/torch/serialization.py", line 358, in load
    return _load(f, map_location, pickle_module)
  File "/lnet/spec/work/people/machacek/neural-interlingua/p2-onmt/local/lib/python2.7/site-packages/torch/serialization.py", line 542, in _load
    result = unpickler.load()
  File "/lnet/spec/work/people/machacek/neural-interlingua/p2-onmt/local/lib/python2.7/site-packages/torch/serialization.py", line 508, in persistent_load
    data_type(size), location)
RuntimeError: $ Torch: not enough memory: you tried to allocate 0GB. Buy new RAM! at /pytorch/aten/src/TH/THGeneral.cpp:204

The dataset is multi30k, so no big file is loaded into memory. I observed this error twice, once after 15 hours of training with 12 src-tgt training pairs, and once after 8 hours with 10 pairs. I don't have any comparable error-free run.

Most probably no other process used the same machine at the same time, but I can't be sure.

Any suggestions or ideas, what happens and how to fix it?
My only idea is to merge the newest master from OpenNMT-py and hope that it's already fixed there.

helsinki-nlp / fotranmt Goto Github PK

fotranmt's People

Contributors

Stargazers

Watchers

Forkers

fotranmt's Issues

Cuda runtime error

Neural interlingua: Torch not enough memory after 8 hours

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent