Comments (9)
Can you try to add the option:
--normalize_embeddings center
to your command python unsupervised.py
and see if this helps?
from muse.
Thank you @glample , I am trying.
from muse.
I have some question about the parameters:
--dis_dropout: Discriminator dropout.
For the parameter above, I just find something about --dis_input_dropout
, namely,"As a result, we only feed the discriminator with the 50,000 most frequent words. " as for dis_input_dropout. But I did not find any clue about "--dis_dropout"
--dis_lambda:Discriminator loss feedback coefficient.
--dis_clip_weights: Clip discriminator weights (0 to disable)
I did not get any idea about these two parameters above via the paper.
--dico_max_rank:Maximum dictionary words rank (0 to disable)
--dico_min_size:Minimum generated dictionary size (0 to disable)
--dico_max_size:Maximum generated dictionary size (0 to disable)
In my opinion, as for the three parameters above, the dictionary generated is for validation, so dico_min_size
and dico_max_size
are used to specify the size of the dictionary. But what is dico_max_rank for?
Thank you @glample
from muse.
it is not useful to specify --normalize_embeddings center
, still got results like:
INFO - 04/25/18 11:06:52 - 0:50:04 - 1500 source words - csls_knn_10 - Precision at k = 1: 0.000000
INFO - 04/25/18 11:06:53 - 0:50:04 - 1500 source words - csls_knn_10 - Precision at k = 5: 0.000000
INFO - 04/25/18 11:06:53 - 0:50:04 - 1500 source words - csls_knn_10 - Precision at k = 10: 0.000000
from muse.
Mmm I think the problem is that the default epoch size we set in the code is too big. I forgot what we used in the paper exactly, but I remember we used a quite small epoch size (and consequently more epochs). This way, the evaluations are more frequent, and you are more likely to get a good model. I just tried the command:
CUDA_VISIBLE_DEVICES=0 python unsupervised.py --src_lang en --tgt_lang zh --src_emb data/wiki.en.vec --tgt_emb data/wiki.zh.vec --n_refinement 10 --n_epochs 10 --epoch_size 250000 --normalize_embeddings center
and after half an hour I get ~33% accuracy P@1. I guess using an even smaller epoch size like 100000 might be even better. I just noticed that there is a small bug when saving the embeddings and using center
for normalize_embeddings
, I'll fix this tomorrow, sorry about that.
Regarding --dico_max_rank
, --dico_min_size
and --dico_max_size
, they correspond to parameters on the synthetic dictionaries we build during the refinement steps. In particular:
--dico_max_rank 15000 # means we will never consider pairs of words where you have a source word or a target word which is not in the top 15000 most frequent words
--dico_max_size 10000 # means we will never consider more than 10000 pairs in total
--dico_min_size 1000 # means that we will always consider at least 1000 pairs (used in combination with dico_threshold that removes translations for which we do not have a high confidence)
You can check all this in dico_builder.py
.
Regarding --dis_dropout
, this is just the dropout between the discriminator layers.
from muse.
Thank you @glample , it becomes normal when training with the latest code(4/23/2018) downloaded.
INFO - 04/25/18 23:42:52 - 3:13:13 - 1500 source words - csls_knn_10 - Precision at k = 1: 33.733333
INFO - 04/25/18 23:42:52 - 3:13:13 - 1500 source words - csls_knn_10 - Precision at k = 5: 53.533333
INFO - 04/25/18 23:42:52 - 3:13:13 - 1500 source words - csls_knn_10 - Precision at k = 10: 59.533333
But there always comes an error when doing mapping with the best model at the last step:
Traceback (most recent call last):
File "unsupervised.py", line 184, in <module>
trainer.export()
File "/home/jack/software/MUSE/src/trainer.py", line 255, in export
normalize_embeddings(src_emb, params.normalize_embeddings, mean=params.src_mean)
File "/home/jack/software/MUSE/src/utils.py", line 419, in normalize_embeddings
emb.sub_(mean)
RuntimeError: inconsistent tensor size, expected r_ [332647 x 300], t [332647 x 300] and src [200000 x 300] to have the same number of elements, but got 99794100, 99794100 and 60000000 elements respectively at /pytorch/torch/lib/TH/generic/THTensorMath.c:1008
this is the bug you mentioned above?
Thank you!
from muse.
Yes, this is the bug I mentioned :) This is fixed: a620cc8
from muse.
Closing for now, feel free to reopen if you still face this issue.
from muse.
I'm unable to reproduce the results above, even with the changes to epoch size and number of epochs.
I am running: python unsupervised.py --src_lang en --tgt_lang zh --src_emb wiki.en.vec --tgt_emb wiki.zh.vec --n_refinement 10 --n_epochs 10 --epoch_size 250000 --normalize_embeddings center
and after 10 epochs (before refinement) only getting the following results:
INFO - 03/25/19 21:23:48 - 0:27:10 - 2230 source words - nn - Precision at k = 1: 1.165919
INFO - 03/25/19 21:23:48 - 0:27:10 - 2230 source words - nn - Precision at k = 5: 2.780269
INFO - 03/25/19 21:23:48 - 0:27:10 - 2230 source words - nn - Precision at k = 10: 3.587444
INFO - 03/25/19 21:23:48 - 0:27:10 - Found 2230 pairs of words in the dictionary (1500 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
INFO - 03/25/19 21:24:10 - 0:27:31 - 2230 source words - csls_knn_10 - Precision at k = 1: 1.121076
INFO - 03/25/19 21:24:10 - 0:27:31 - 2230 source words - csls_knn_10 - Precision at k = 5: 3.228700
INFO - 03/25/19 21:24:10 - 0:27:31 - 2230 source words - csls_knn_10 - Precision at k = 10: 4.843049
What's also interesting is the mean cosine validation metric seems to decrease as the precision improves. That last epoch had the following value:
Mean cosine (csls_knn_10 method, S2T build, 10000 max size): 0.38720
while the first epoch (with worse precision) had:
Mean cosine (csls_knn_10 method, S2T build, 10000 max size): 0.62325
Any idea what's going on here?
from muse.
Related Issues (20)
- why unsupervised can achieve Word alignment?
- Can some one give the dictionary tree of the whole project? Like in the data/crosslingual or monlingual/.. HOT 5
- non-parallel chinese traditional - english
- evaluate.py error
- openssl ssl_read ssl_error_syscall errno 110
- Reproducing Results in Table 1 HOT 1
- IndexError: index out of range in self
- AttributeError: 'Namespace' object has no attribute 'dico_max_rank'
- Assertion Error while using the unsupervised way.
- Tokenization issue in to-En bilingual dictionaries
- They hated the kid HOT 1
- Bad outcome in ja-en task HOT 1
- Rush Shhh INPUT aUTOMATION
- ValueError: too many values to unpack (expected 2) in unsupervised.py
- Will pytorch's deprecation of volatile affect the result?
- [ML Question] Is it possible somehow to translate two or three words ?
- Tried on GloVe?
- self-mapped english words in dictionaries
- ValueError: Function has keyword-only parameters or annotations, use inspect.signature() API which can support them HOT 3
- demo notebook references unavailable private files
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from muse.