Giter Club home page Giter Club logo

mrasp2's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

mrasp2's Issues

what the "meta.ras_dict" in config is?

the last line of "meta" in the example config said that there should be a json file called "data/lang150/dicts/id_dict_1.json".
do we have this one in the repo?
or an example what it should contain?

Clarifications in training config

Hello

Thank you for your excellent work, and well-documented repo. I am trying to use your code to train a new model from scratch, and require some clarification on certain parts that are unclear to me, especially regarding the config.

(Please note I am referring to this config on the new_impl branch as an example of how I could create my own)

  1. data (under meta): What does this refer to? Is this the directory that contains binarized versions of one multilingual parallel dataset made by concatenating datasets from several language pairs (eg. en-es, en-fr, en-it) or does it contain language pair-specific binary files in its subdirectories?

  2. I can see that in load_config.sh variables starting with meta_ are not written to the options variable, and both monolingual and parallel data are provided separately in train_multilingual_w_mono.sh. This seems to suggest that paths are expected in the form of data_1, data_2 etc.. If so, could you please confirm what these paths refer to? I.e. how does data_1 differ from data_2?

  3. What is mono_dae? This is referred to repeatedly in the codebase, at various places. Would I need to set mono_key in the config file to mono_dae?

  4. Lastly, I have parallel and monolingual datasets that I have already preprocessed (with RAS substitution and language token prefixes). Would I need to set variables like langtoks, encoder_langtok and decoder_langtok?

Hope I can receive some assistance on this issue soon. Thanks!

An error occurred during model inference

I am using the new version. There was an error in the model inference stage. The model used 12e12d_no_mono.pt, the data used the test set binary file provided by you, and other configuration files were reconfigured according to examples/configs/biginfer.
File "XXX/lib/python3.7/site-packages/fairseq/checkpoint_utils.py", line 420, in _upgrade_state_dict
state["args"].task = "translation"
AttributeError: 'NoneType' object has no attribute 'task'

dataset

请问下论文中指的单向语料是什么意思呢?我个人理解双向语料就是同语义<English,Chinese>的sentence这种形式,但是单向语料没有对应的标签怎么计算交叉熵loss和contrastive loss呢?其次,看代码实现中数据的内容是<String,Coding>的pair对形式,这个和我理解的不太一样,想了好久不知道将pair对的数据处理成这样的。

How can we fine-tune the model

it seems that there is no introduction about how to fine-tune the pre-trained model. Could you get me some instructions?

fairseq-train config question

In the example config, what does min_lr stand for? When I run the command provided in readme, it gives me the following error.
7510c29be303e5ef6993c102e56dd95
I checked the fairseq documentation and found this:
image
Should I use this argument instead?
Thank you for answering!

Gather all of the outputs from different nodes and then compute contrastive loss

Hi,

this is a really great paper. I have a question about the calculation of the contrastive loss.

In the paper, you "use 8 × 4 NVIDIA V100 with update frequency 50 to train the models and each batch contains about 3 million tokens". Do you compute the contrastive loss within a mini-batch on a GPU or on all GPUs?

For the conventional contrastive loss computing, we need to gather all outputs from the models on different GPUs, and then compute it. But in your code, I can not find the code for that. So do you only compute it with the outputs only on a GPU?

If I misunderstand your code, could you point out where you gather the outputs? Normally we use something like "torch.gather_all()".

pre-norm or post-norm

您好!我看paper中有提到在训练时你们使用了pre-norm,但我看公布的代码设置仍然是post-norm,请问这个pre-norm的设置是在哪里另外配置了吗

About swap sample

Hi Xiao,

I have a question about the swap_sample function in label_smoothed_cross_entropy_with_contrastive.py


Here, after swaping the sample, src_tokens are the same as the original target tokens.

However, the padding positions for source and target are different. src uses left padding while tgt uses right padding (see details below)
https://github.com/facebookresearch/fairseq/blob/a0ceabc287e26f64517fadb13a54c83b71e8e469/fairseq/tasks/translation.py#L200

Thus, why not do the left padding for new source tokens (old target tokens), which are originally with the right padding?

src tokens fed twice into the encoder?

Hi,

In the criterion script of constructing the contrastive loss, one line is:

net_output = model(**sample["net_input"])

for this, [src tokens -> encoder] -> decoder -> output

another line:

encoder_out = model.encoder.forward(sample["net_input"]["src_tokens"], sample["net_input"]["src_lengths"]).encoder_out

for this, src tokens -> encoder -> encoder output, which is the same as the part in [ ] above

It seems you feed the src tokens 2 times into the encoder,
although the loss computation will be right,
will this decrease the training efficiency?

Or anything I missed?
Thank you in advance.

Question on WMT16 en-ro

On WMT16 en->ro benchmark, the reported results on this website (28.7) is quite different from that reported in your paper (38.0). Is it possible for you to release your bpe tokenized wmt16 en-ro testset? I am trying to reproduce your results on this benchmark but cannnot achieve comparable performance.

Thanks a lot!

Release of synonym dictionary

Hi, this is really a great paper. In the paper, you said you would release the synonym dictionary. May I ask when will you release it? In addition, is it a multilingual synonym dictionary? Do you have monolingual synonym dictionary, e.g. only for english.

mrasp2在zh-en,en-zh 方向上的 finetune

您好,请问您在 wmt17的 zh-en 以及 en-zh 两个方向上设置的学习率和optimizer 的具体参数是多少呢?是仅仅用了 wmt17的 中英平行语料进行微调吗?wmt17的中英训练数一共多少 sample呢?大概是训练了多少个 epoch 收敛的?
期待您的答复!

Question about the dataset

Congratulaitons on your interesting work!

Is the dataset introduced in the original implementation different from that in your new implementation?

BTW, is it posible for you to realease the monolingual dataset and various testsets in the tokenized raw text format used in your original implementation?

Thanks!

Error downloading dataset

image
Hello, our team recognizes your work very much, but there are some problems during the replication: download After changing the domain name in the. sh file, a 404 error occurs in the download. Can you answer it at your convenience? esteem it a favor

Project dependencies may have API risk issues

Hi, In mRASP2, inappropriate dependency versioning constraints can cause risks.

Below are the dependencies and version constraints that the project is using

subword-nmt
sacrebleu
sacremoses
kytea
six

The version constraint == will introduce the risk of dependency conflicts because the scope of dependencies is too strict.
The version constraint No Upper Bound and * will introduce the risk of the missing API Error because the latest version of the dependencies may remove some APIs.

After further analysis, in this project,
The version constraint of dependency sacrebleu can be changed to >=1.1.0,<=1.1.1.
The version constraint of dependency sacrebleu can be changed to >=1.1.3,<=1.4.5.

The above modification suggestions can reduce the dependency conflicts as much as possible,
and introduce the latest version as much as possible without calling Error in the projects.

The invocation of the current project includes all the following methods.

The calling methods from the sacrebleu
sacrebleu.corpus_bleu
sacrebleu.compute_bleu
The calling methods from the all methods
get_hypo_and_ref
numpy.array
counts.append
fairseq.utils.strip_pad
max
all_dataset_upsample_ratio.strip
fairseq.data.PrependTokenDataset
self.swap_sample
FileNotFoundError
self.tgt_dict.string
model.encoder.forward.transpose
torch.no_grad
inspect.getfullargspec
log.get
float
tqdm.tqdm
hyps.append
json.loads
torch.cat
f.read.split
self.temperature.anchor_dot_contrast.torch.div.nn.LogSoftmax.diag
cls.load_dictionary.index
eval.readlines
self.dataset.size
self.temperature.anchor_dot_contrast.torch.div.nn.LogSoftmax.diag.sum
isinstance
torch.LongTensor
_sentence_embedding
self.inference_step
fairseq.criterions.label_smoothed_cross_entropy.LabelSmoothedCrossEntropyCriterion.add_args
open.close
mask.float
cls
recover_bpe
open
hasattr
id_num.score_dict.append
self.dataset.prefetch
fairseq.data.TruncateDataset
eval.read
numpy.array.sum
src_list.append
fairseq.models.transformer.transformer_wmt_en_de_big_t2t
self.tgt_dict.pad
eval
toks.int
src_datasets.append
format
bpe_symbol.line.replace.rstrip
cls.load_dictionary.eos
fairseq.data.data_utils.infer_language_pair
torch.cat.contiguous
logging.getLogger.info
argparse.Namespace
fairseq.models.register_model_architecture
j.line.split
similarity_function
str
Exception
ValueError
self.set_epoch
open.write
mask.float.sum.unsqueeze
fairseq.data.AppendTokenDataset
fairseq.data.encoders.build_tokenizer
super.set_epoch
super.__init__
cls.load_dictionary.unk
size_ratio.dataset.len.np.ceil.astype
super
self.padding_idx.src_tokens.int.sum
numpy.argsort
super.reduce_metrics
self.padding_idx.src_tokens.int
itertools.count
os.path.join
self.padding_idx.target.int
super.build_model
generator.generate
super.valid_step
round
int
len
fairseq.data.indexed_dataset.dataset_exists
refs.append
os.path.dirname
torch.nn.LogSoftmax
toks.int.cpu
logging.getLogger
re.compile
mask.unsqueeze
self.tokenizer.decode
numpy.ceil
remove_bpe_fn
fairseq.tasks.register_task
fairseq.tasks.translation.TranslationTask.add_args
re.search.span
torch.nn.CosineSimilarity
self.dataset.num_tokens
totals.append
fairseq.utils.deprecation_warning
self.compute_loss
cls.load_dictionary
self.target_dictionary.index
prefix_tokens.to.to
split_exists
fairseq.utils.eval_bool
remove_bpe
torch.transpose
self.len.np.random.permutation.astype
getattr
fairseq.tasks.translation.load_langpair_dataset
torch.div
re.search
target.contiguous
sum_logs
fairseq.metrics.log_scalar
self.padding_idx.target.int.sum
contrast_feature.expand
numpy.random.permutation
tgt_list.append
self.dataset.__getitem__
numpy.random.RandomState
cls.load_dictionary.bos
src_tokens.size
numpy.random.RandomState.choice
load_langpair_dataset
bpe_symbol.line.replace.rstrip.replace
fairseq.data.data_utils.load_indexed_dataset
cls.load_dictionary.pad
sacrebleu.compute_bleu
fairseq.options.eval_bool
mask.float.sum
map
self.get_contrastive_loss
fairseq.data.StripTokenDataset
self.build_generator
fairseq.utils.split_paths
fairseq.data.ConcatDataset
fairseq.metrics.log_derived
decode
data.SubsampleLanguagePairDataset
model
join
parser.add_argument
id_num.hypothesis_dict.append
tgt_datasets.append
math.log
fairseq.data.plasma_utils.PlasmaArray
prefix_tokens.to.expand
self.similarity_function
all_dataset_upsample_ratio.strip.split
fairseq.data.LanguagePairDataset
id_num.pos_score_dict.append
mask.unsqueeze.encoder_output.sum
numpy.arange
fairseq.utils.item
o.write
sacrebleu.corpus_bleu
reprocess
sample.size
re.search.group
fairseq.models.transformer.transformer_wmt_en_de
fairseq.criterions.register_criterion
self._inference_with_bleu
mono_datas.append
range
anchor_feature.expand
prefix_tokens.torch.LongTensor.unsqueeze
sum
model.encoder.forward

@developer
Could please help me check this issue?
May I pull a request to fix it?
Thank you very much.

Link broken of datasets

Hi, I find some links to the datasets seems to be broken. And reported the following error "upstream server error". Is there new links provided? Thanks!

Where can I get trained models?

Hi, i'm very interested in your work and wanna do additional experiments with the model.

Where can I get the trained one?

Thank you for your great job!

Inference Error

Hey, this is a really great work. But I ran into a problems when using the model for inferences.

You have released three models: 6e6d-no-mono, 12e12d-no-mono and 12e12d.
I try to use 12e12d-no-mono and 12e12d to translate Hindi to English. However, this problem is encountered: sometimes 12e12d cannot decode the token correctly, but 12e12d-no-mono can decode it correctly. The following is my test sample and the token predicted by the model:

model: 12e12d

S-6 LANG_TOK_HI इस समय आ@@ ठ अं@@ को के साथ इ@@ ट@@ ली पू@@ ल C में ती@@ स@@ रे नं@@ बर पर हैं और इ@@ ट@@ ली को 29 सि@@ तं@@ बर को स@@ ्@@ कॉ@@ ट@@ ल@@ ै@@ ंड के खि@@ ला@@ फ@@ ़ दूस@@ रे मै@@ च में कड@@ ़@@ ी ट@@ क@@ ्@@ कर मि@@ ली ।
H-6 -0.6864292621612549 LANG_TOK_EN Ital@@ y is now on the th@@ ir@@ d spo@@ t in Po@@ ol C with eig@@ ht points and Ital@@ y fo@@ und a tie on September 29 against Sc@@ ot@@ land in a sec@@ ond mat@@ ch .
S-7 LANG_TOK_HI न@@ ्@@ यू@@ ज@@ ़@@ ी@@ ल@@ ै@@ ंड ग@@ ्@@ रु@@ प में प@@ ्@@ रथम श@@ ्@@ रे@@ णी पर , स@@ ्@@ कॉ@@ ट@@ ल@@ ै@@ ंड से 10 प@@ ॉ@@ इं@@ ट से आ@@ गे रहा ।
H-7 -0.6589236855506897 ् न ् यू@@ जी@@ ल@@ ै@@ ंड सम@@ ू@@ ह में पहले श ् रे@@ णी पर , स ् कॉ@@ ट@@ ल@@ ै@@ ंड से 10 प@@ ॉ@@ इं@@ ट से आ@@ गे रहा ।

model: 12e12d-no-mono

S-6 LANG_TOK_HI इस समय आ@@ ठ अं@@ को के साथ इ@@ ट@@ ली पू@@ ल C में ती@@ स@@ रे नं@@ बर पर हैं और इ@@ ट@@ ली को 29 सि@@ तं@@ बर को स@@ ्@@ कॉ@@ ट@@ ल@@ ै@@ ंड के खि@@ ला@@ फ@@ ़ दूस@@ रे मै@@ च में कड@@ ़@@ ी ट@@ क@@ ्@@ कर मि@@ ली ।
H-6 -0.5951337218284607 LANG_TOK_EN Ital@@ y is cur@@ rent@@ ly th@@ ir@@ d in Po@@ ol C with eig@@ ht points and scor@@ ed a tie against Sc@@ ot@@ land in the sec@@ ond mat@@ ch on September 29 .
S-7 LANG_TOK_HI न@@ ्@@ यू@@ ज@@ ़@@ ी@@ ल@@ ै@@ ंड ग@@ ्@@ रु@@ प में प@@ ्@@ रथम श@@ ्@@ रे@@ णी पर , स@@ ्@@ कॉ@@ ट@@ ल@@ ै@@ ंड से 10 प@@ ॉ@@ इं@@ ट से आ@@ गे रहा ।
H-7 -0.6146384477615356 LANG_TOK_EN In the New Ze@@ al@@ and gro@@ up , it was 10 points a@@ head of Sc@@ ot@@ land in the first clas@@ s .

The following is my script:
model: 12e12d
fairseq-generate ./test_data/bin \ --user-dir ./mcolt \ -s hi \ -t en \ --path ./model/12e12d_last.pt \ --max-tokens 1024 \ --task translation_w_langtok \ --lang-prefix-tok "LANG_TOK_"echo "en " | tr '[a-z]' '[A-Z]' \ --max-source-positions 1024 \ --max-target-positions 1024 \ --nbest 1 | grep -E '[S|H|P|T]-[0-9]+' > ./test_data/trans_res/en_12e12d_last.txt
model: 12e12d-no-mono
fairseq-generate ./test_data/bin \ --user-dir ./mcolt \ -s hi \ -t en \ --path ./model/12e12d_no_mono.pt \ --max-tokens 1024 \ --task translation_w_langtok \ --lang-prefix-tok "LANG_TOK_"echo "en " | tr '[a-z]' '[A-Z]' \ --max-source-positions 1024 \ --max-target-positions 1024 \ --nbest 1 | grep -E '[S|H|P|T]-[0-9]+' > ./test_data/trans_res/en_12e12d_no_mono.txt

It can be found that the tokens predicted by the two models for H-7 are completely inconsistent. The first position should be LANG_TOK_EN, but the model is decoded to . Of course, the tokens after LANG_TOK_ are neither fully source language tokens nor target language tokens. In my testset, there are other sentences that have also been decoded into this situation. And their first token is .

Why does this happen? Did I not input the parameters expected by 12e12d correctly?

Dataset for evaluation

Hi there,
I have been trying to use the mRASP2 model for evaluation and have run into a couple of issues.

  1. The sample yaml file for evaluation has the following format:
    data_testset_1:
    direction: en2de
    name: wmt14
    path: data/binarized/en_de/en2de/wmt14
    ref: data/dev/en2de/wmt14

What are the path and ref referring to? How do we get the binarized version? Is there a script that I can follow or a link to the dataset that was used.

Additionally, what is the fairseq model used while evaluating?

Where is self.compute_loss( ) defined

In the python file label_smoothed_cross_entropy_with_contrastive.py
def forward(self, model, sample, reduce=True):
net_output = model(**sample["net_input"])
loss, nll_loss = self.compute_loss(model, net_output, sample, reduce=reduce)
....

Where is self.compute_loss( ) defined?

loss average

Hi,in this line

loss = -nn.LogSoftmax(0)(torch.div(anchor_dot_contrast, self.temperature)).diag().sum()

why is it sum rather than mean? Does fairseq library will automatically do average in batch? Sorry , I am not familiar with this framework. And I also notice that reduce function is sum in compute_loss https://github.com/pytorch/fairseq/blob/14c5bd027f04aae9dbb32f1bd7b34591b61af97f/fairseq/criterions/label_smoothed_cross_entropy.py#L46
and ntokens/nsenteces mean average token number within a batch, right?
all_loss = loss + contrastive_loss * self.contrastive_lambda * ntokens / nsentences

Could you please tell the loss in the early training stage , because according to my empirical experiment, without multiplying ntokens/nsentences to contrastive_loss, it is already in the same order of magnitude, thanks so much!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.