nyu-dl / dl4mt-multi Goto Github PK
View Code? Open in Web Editor NEWLicense: BSD 3-Clause "New" or "Revised" License
License: BSD 3-Clause "New" or "Revised" License
Hi,
I am trying to train an 2-Encoder/1-Decoder model, but the training process seems to get stuck right before the first iteration.
The computation graphs seems to be built (allocates ca. 3.5GB on gpu0), the main loop of blocks is being entered:
-------------------------------------------------------------------------------
BEFORE FIRST EPOCH
-------------------------------------------------------------------------------
Training status:
batch_interrupt_received: False
epoch_interrupt_received: False
epoch_started: True
epochs_done: 0
iterations_done: 0
received_first_batch: False
resumed_from: None
training_started: True
Log records from the iteration 0:
time_initialization: 3.81469726562e-06
but then nothing happens. GPU displays 0% load, waited for 1h, also tested with very small training files. The main CPU process is running at 100%. Has anybody seen behaviour like that? Any suggestions?
Thanks,
Marcin
Hi!
I have two quesitons:
Hello,
I am using python 2.7.6. while running 'python train_mlnmt.py' I got following error.
File "train_mlnmt.py", line 27, in
train(config, get_tr_stream(config), get_dev_streams(config),
File "/home/development/jigar/jigar/dl4mt-multi/mcg/stream.py", line 214, in get_tr_stream
for k, v in config['src_vocabs'].iteritems()}
File "/home/development/jigar/jigar/dl4mt-multi/mcg/stream.py", line 214, in
for k, v in config['src_vocabs'].iteritems()}
AttributeError: 'module' object has no attribute 'OrderedDict'
I guess this error is specific to python version earlier to 2.6.
Hi thanks for sharing great work.
i'm trying to train multi-way model but encountered some issues related to intermediate validation.
first one is that the code never executes bleu validation.
i properly set 'bleu_val_freq' and 'bleu_script' config but nothing happens, validation is just skipped.
the second one is about log_prob_freq.
computing log probability on val set is executed at every log_prob_freq i specified, but it outputs error like this:
Traceback (most recent call last):
File "train_mlnmt.py", line 25, in
get_logprob_streams(config))
File "/home/pjh/dl4mt-multi/mlnmt.py", line 152, in train
main_loop.run()
File "/home/pjh/dl4mt-multi/mcg/algorithm.py", line 328, in run
reraise_as(e)
File "/home/pjh/codes/blocks/blocks/utils/init.py", line 225, in reraise_as
six.reraise(type(new_exc), new_exc, orig_exc_traceback)
File "/home/pjh/dl4mt-multi/mcg/algorithm.py", line 314, in run
while self._run_epoch():
File "/home/pjh/dl4mt-multi/mcg/algorithm.py", line 352, in _run_epoch
while self._run_iteration():
File "/home/pjh/dl4mt-multi/mcg/algorithm.py", line 374, in _run_iteration
self._run_extensions('after_batch', batch)
File "/home/pjh/dl4mt-multi/mcg/algorithm.py", line 382, in _run_extensions
extension.dispatch(CallbackName(method_name), *args)
File "/home/pjh/codes/blocks/blocks/extensions/init.py", line 328, in dispatch
self.do(callback_invoked, *(from_main_loop + tuple(arguments)))
File "/home/pjh/dl4mt-multi/mcg/extensions.py", line 375, in do
if numpy.isnan(numpy.mean(probs[cg_name])):
File "/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 2885, in mean
out=out, keepdims=keepdims)
File "/usr/local/lib/python2.7/dist-packages/numpy/core/_methods.py", line 72, in _mean
ret = ret / rcount
TypeError: unsupported operand type(s) for /: 'list' and 'int'
Original exception:
TypeError: unsupported operand type(s) for /: 'list' and 'int'
it seems numpy related issue, my numpy version is 1.11.0
Other than these, sampling_freq, save_freq and actual training works fine.
Any ideas or comments would be appreciated! FYI i'm attaching my configurations.
config['normalized_bleu'] = True
config['track_n_models'] = 3
config['output_val_set'] = True
config['beam_size'] = 12
config['log_prob_freq'] = 20
config['lob_prob_bs'] = 10
config['reload'] = True
config['save_freq'] = 25000
config['sampling_freq'] = 2500
config['bleu_val_freq'] = 25
config['val_burn_in'] = 1
config['finish_after'] = 2000000
config['incremental_dump'] = True
Hello,
I start this model by issuing command python train_mlnmt.py that raises an attribute error. I am not sure this is due to python version. I am using python 2.7.13
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.