helsinki-nlp / fotranmt Goto Github PK
View Code? Open in Web Editor NEWThis project forked from opennmt/opennmt-py
Open Source Neural Machine Translation in PyTorch
Home Page: http://opennmt.net/
License: MIT License
This project forked from opennmt/opennmt-py
Open Source Neural Machine Translation in PyTorch
Home Page: http://opennmt.net/
License: MIT License
Hello, Getting this error in the middle of the training. I have tried training several times but the same error comes at different steps. The issue starts at validation steps as I trace back the error.
@raganato
/opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/THCTensorIndex.cu:360: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [5,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/THCTensorIndex.cu:360: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [5,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
.
.
.
Traceback (most recent call last):
File "train.py", line 40, in <module>
main(opt)
File "train.py", line 27, in main
single_main(opt)
File "/home/aki/OpenNMT-py/onmt/train_single.py", line 262, in main
opt.valid_steps)
File "/home/aki/OpenNMT-py/onmt/trainer.py", line 223, in train
report_stats)
File "/home/aki/OpenNMT-py/onmt/trainer.py", line 384, in _gradient_accumulation
dec_state)
File "/home/aki/anaconda3/envs/zeroshot/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/home/aki/OpenNMT-py/onmt/models/model.py", line 75, in forward
memory_lengths=lengths)
File "/home/aki/anaconda3/envs/zeroshot/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/home/aki/OpenNMT-py/onmt/decoders/decoder.py", line 139, in forward
tgt, memory_bank, state, memory_lengths=memory_lengths)
File "/home/aki/OpenNMT-py/onmt/decoders/decoder.py", line 350, in _run_forward_pass
memory_lengths=memory_lengths)
File "/home/aki/anaconda3/envs/zeroshot/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/home/aki/OpenNMT-py/onmt/modules/global_attention.py", line 181, in forward
mask = sequence_mask(memory_lengths, max_len=align.size(-1))
File "/home/aki/OpenNMT-py/onmt/utils/misc.py", line 23, in sequence_mask
.type_as(lengths)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/generic/THCTensorCopy.cpp:21
Hello,
I'm experimenting with my PR branch of neural-interlingua. After 8 hours of non-problematic training this happened:
[2018-11-07 21:47:35,402 INFO] Current language pair: ('de', 'cs')
[2018-11-07 21:47:35,433 INFO] Loading valid dataset from all_pairs_preprocessed/de-cs/data.valid.1.pt, number of examples: 1014
[2018-11-07 21:47:36,350 INFO] Validation perplexity: 6.25632
[2018-11-07 21:47:36,351 INFO] Validation accuracy: 62.3635
[2018-11-07 21:48:45,551 INFO] >> BLEU = 23.28, 55.7/29.4/17.9/11.0 (BP=0.976, ratio=0.977, hyp_len=10099, ref_len=10342)
[2018-11-07 21:48:45,555 INFO] Current language pair: ('fr', 'cs')
[2018-11-07 21:48:45,581 INFO] Loading valid dataset from all_pairs_preprocessed/fr-cs/data.valid.1.pt, number of examples: 1014
[2018-11-07 21:48:46,478 INFO] Validation perplexity: 5.42896
[2018-11-07 21:48:46,478 INFO] Validation accuracy: 64.3625
Traceback (most recent call last):
File "../train.py", line 40, in <module>
main(opt)
File "../train.py", line 27, in main
single_main(opt)
File "/lnet/spec/work/people/machacek/neural-interlingua/OpenNMT-py-dm/onmt/train_single.py", line 239, in main
opt.valid_steps)
File "/lnet/spec/work/people/machacek/neural-interlingua/OpenNMT-py-dm/onmt/trainer.py", line 271, in train
parser = argparse.ArgumentParser(prog = 'translate.py',
File "/lnet/spec/work/people/machacek/neural-interlingua/OpenNMT-py-dm/onmt/translate/translator.py", line 43, in build_translator
model = onmt.model_builder.load_test_multitask_model(opt)
File "/lnet/spec/work/people/machacek/neural-interlingua/OpenNMT-py-dm/onmt/model_builder.py", line 133, in load_test_multitask_model
map_location=lambda storage, loc: storage)
File "/lnet/spec/work/people/machacek/neural-interlingua/p2-onmt/local/lib/python2.7/site-packages/torch/serialization.py", line 358, in load
return _load(f, map_location, pickle_module)
File "/lnet/spec/work/people/machacek/neural-interlingua/p2-onmt/local/lib/python2.7/site-packages/torch/serialization.py", line 542, in _load
result = unpickler.load()
File "/lnet/spec/work/people/machacek/neural-interlingua/p2-onmt/local/lib/python2.7/site-packages/torch/serialization.py", line 508, in persistent_load
data_type(size), location)
RuntimeError: $ Torch: not enough memory: you tried to allocate 0GB. Buy new RAM! at /pytorch/aten/src/TH/THGeneral.cpp:204
The dataset is multi30k, so no big file is loaded into memory. I observed this error twice, once after 15 hours of training with 12 src-tgt training pairs, and once after 8 hours with 10 pairs. I don't have any comparable error-free run.
Most probably no other process used the same machine at the same time, but I can't be sure.
Any suggestions or ideas, what happens and how to fix it?
My only idea is to merge the newest master from OpenNMT-py and hope that it's already fixed there.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.