Assumed that the embeddings have learned joint languages representations (so that <cod

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Hope the added <a href="https://github.com/Tiiiger/bert_score/blob/master/bert_score/u

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

[Question] Cross-lingual Score about bert_score HOT 10 CLOSED

tiiiger commented on July 18, 2024

[Question] Cross-lingual Score

from bert_score.

Comments (10)

Tiiiger commented on July 18, 2024 2

Just a heads up. The absolute score may be less meaningful because they can have different ranges. Ideally, you would like to show that the Score correlates with human judgment, which unfortunately I don't know any.

As I understand, you don't have any more implementation questions so I am closing this.

I am happy to chat about the potential research opportunities for the cross-lingual scores. Feel free to continue the conversation under this issue or contact us directly through emails if you want to keep it private.

from bert_score.

shoegazerstella commented on July 18, 2024 1

We just saw this https://github.com/facebookresearch/XLM implementation of a cross-lingual Language Model based on BERT.
It seems that the XNLI-15 model could be a nice first solution:

XNLI-15 is the model used in the paper for XNLI fine-tuning.
It handles English, French, Spanish, German, Greek, Bulgarian, 
Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu.
It is trained with the MLM and the TLM objectives. 
For this model we used a different preprocessing than for the MT models (such as lowercasing and accents removal).

There is an example on how it works.

Could you consider implementing this in your library?

from bert_score.

Tiiiger commented on July 18, 2024 1

Thank you @shoegazerstella for letting us know. We are definitely going to look into it but it may take some time before we get back to you.

If this is really important to your research, I encourage you to fork the repo and start implementing it. The general backend of BERTScore is at https://github.com/Tiiiger/bert_score/blob/master/bert_score/utils.py. Please let me know if you have any questions.

from bert_score.

Tiiiger commented on July 18, 2024 1

Hope the added docs can help you.

from bert_score.

Tiiiger commented on July 18, 2024

Hello,

we also conjecture that this is possible although we have not done a proper study regarding this hypothesis.

from bert_score.

shoegazerstella commented on July 18, 2024

Hi @Tiiiger,
What would you suggest if we want to verify whether this hypothesis is valid? Is it possible to use an existing model or there is the need of a new training on BERT?

I was trying this:

cands = ['hello how are you?', 'cat', 'house']
refs = ['ciao come stai?', 'gatto', 'topo']

P, R, F1 = score(cands, refs, bert="bert-base-multilingual-cased", verbose=True)

P: tensor([0.6788, 0.6708, 0.7756])
R: tensor([0.7002, 0.6582, 0.7756])
F1: tensor([0.6893, 0.6644, 0.7756])

And the confusion matrix plot is this:

Clearly this is not working as the last words house/topo in the two lists are not the same, but they have the highest score of similarity.

from bert_score.

shoegazerstella commented on July 18, 2024

I am trying to implement the solution discussed. You can find the code here.

Apologies if this is not the most elegant solution but was the fastest for me to test today.

So I am running bert_score_test.py but I get stuck after the embedding generation, the prints refer to ref_stats and hyp_stats in bert_score/utils.py:

calculating scores...
loading facebook-XLM model..
Loading vocabulary from XLM/models/vocab_xnli_15.txt ...
Read 4622450944 words (121332 unique) from vocabulary file.
Loading codes from XLM/models/codes_xnli_15.txt ...
Read 80000 codes from the codes file.
  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]generating embeddings from facebook-XLM model..
{'es': 'hola como estas?'}
8
torch.Size([8, 1, 1024])
torch.Size([1, 1024])
tensor([[[-0.0235, -1.1157,  5.5236,  ..., -1.7445,  4.6693, -5.4893]],

        [[-4.8977, -5.8174, -0.0425,  ..., -4.0513,  1.5466,  1.0361]],

        [[-3.0278, -2.7101, -6.6004,  ..., -2.3234,  2.0516,  0.8349]],

        ...,

        [[-3.4236, -3.7358,  1.5622,  ..., -2.7245,  0.3207,  1.5517]],

        [[-1.3155, -2.3146,  0.8112,  ...,  1.7799, -0.2109,  4.8358]],

        [[-3.9836, -0.7102,  1.4045,  ..., -2.2827,  5.0350,  8.0413]]],
       grad_fn=<TransposeBackward0>)
generating embeddings from facebook-XLM model..
{'en': 'hello how are you?'}
7
torch.Size([7, 1, 1024])
torch.Size([1, 1024])
tensor([[[-3.6706e+00, -5.1693e+00,  2.3415e+00,  ..., -3.3566e+00,
           2.2613e+00,  1.2468e+01]],

        [[-5.1743e+00, -5.0928e+00,  1.0318e-02,  ..., -5.8567e+00,
          -3.4373e+00,  6.0835e+00]],

        [[-2.3680e+00, -1.0124e+01,  3.8484e+00,  ..., -2.8918e+00,
          -8.9933e+00, -2.7259e+00]],

        ...,

        [[-7.4168e+00, -3.6042e+00,  3.5969e+00,  ...,  3.8602e+00,
          -6.1241e-01,  5.5241e-01]],

        [[-1.0793e+00, -4.0387e+00,  5.8260e+00,  ...,  3.7948e+00,
           2.2968e+00, -1.2407e+01]],

        [[-3.8262e+00, -5.0583e+00,  5.7023e+00,  ..., -4.9191e-01,
           4.4571e+00, -2.0888e+00]]], grad_fn=<TransposeBackward0>)
Traceback (most recent call last):
  File "bert_score_test.py", line 13, in <module>
    P, R, F1 = score(cands, refs, cands_lang, refs_lang, bert="facebook-XLM", verbose=True, no_idf=no_idf) 
  File "/src/bert_score/score.py", line 65, in score
    verbose=verbose, device=device, batch_size=batch_size)
  File "/src/bert_score/utils.py", line 192, in bert_cos_score_idf
    P, R, F1 = greedy_cos_idf(*ref_stats, *hyp_stats)
TypeError: greedy_cos_idf() takes 8 positional arguments but 15 were given

I am printing the size of the tensors to compare with your implementation. Is this error related to their shape?
By running bert_score_test.py using bert-base-multilingual-cased I get this:

The pre-trained model you are loading is a cased model but you have not set `do_lower_case` to False. We are setting `do_lower_case=False` for you but you may want to check this behavior.
calculating scores...
  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]<class 'tuple'>
4
torch.Size([1, 7, 768])
(tensor([[[-0.0576, -0.0147,  0.0266,  ...,  0.8648,  1.4775, -0.6607],
         [ 0.0755,  0.0690, -0.3626,  ...,  0.3833,  1.1005,  0.4550],
         [ 0.1177,  0.3928,  0.2649,  ..., -0.5387,  0.6300,  0.0785],
         ...,
         [ 0.2860,  0.2741,  0.0339,  ...,  0.5458,  0.6054,  0.3276],
         [ 0.4984,  0.4997,  0.1665,  ..., -0.2474,  0.7287, -0.1994],
         [ 0.2426, -0.3561,  0.9417,  ...,  0.2333,  1.1731, -0.5414]]]), tensor([7]), tensor([[1, 1, 1, 1, 1, 1, 1]]), tensor([[0., 1., 1., 1., 1., 1., 0.]]))
<class 'tuple'>
4
torch.Size([1, 8, 768])
(tensor([[[ 0.1399,  0.0636, -0.3477,  ...,  1.2948,  1.3497, -0.8473],
         [ 0.5074,  0.4649,  0.0511,  ...,  1.7759,  1.0347, -0.7468],
         [ 0.6105,  0.7187,  0.1068,  ...,  0.8703,  0.7290, -0.4718],
         ...,
         [ 0.0584,  0.8752,  0.4854,  ...,  0.8477, -0.3838, -0.3481],
         [ 0.7315,  0.2678,  0.0808,  ..., -0.2716,  0.4328, -0.6448],
         [ 0.6694, -0.4003,  0.9021,  ...,  0.4409,  0.8974, -0.7192]]]), tensor([8]), tensor([[1, 1, 1, 1, 1, 1, 1, 1]]), tensor([[0., 1., 1., 1., 1., 1., 1., 0.]]))
100%|############################################################################################################################| 1/1 [00:01<00:00,  1.52s/it]
done in 1.57 seconds
['hello how are you?']
['hola como estas?']
P: tensor([0.7123])
R: tensor([0.7264])
F1: tensor([0.7193])

so here instead, ref_stats and hyp_stats are tuples.
The code for XLM embedding generation is here.
Do you have any tip on this? Thanks a lot for your help!

from bert_score.

Tiiiger commented on July 18, 2024

I think you gave the wrong number of arguments to greedy_cos_idf.

I will add documentation to utils.py by tonight. Hang on.

from bert_score.

shoegazerstella commented on July 18, 2024

Hi @Tiiiger, thanks a lot for the docs, it helped a lot.
I managed to make it work but I am still missing something.
I am a bit struggling on how to correctly compute the idf_dict from https://github.com/facebookresearch/XLM.

As of now I have these very bad results:

['hello how are you?']
['hola como estas?']

XLM:
P: tensor([0.6018])
R: tensor([0.6582])
F1: tensor([0.6287])

bert-base-multilingual-cased:
P: tensor([0.7123])
R: tensor([0.7264])
F1: tensor([0.7193])

Seems that bert-base-multilingual-cased is performing way better.

from bert_score.

shoegazerstella commented on July 18, 2024

Hi @Tiiiger,

I had to modify something in the XLM vocabulary and use the bpe as tokenizer. You can see some more changes here. The code still does not work for more than 1 reference phrase at a time though.

I would like to ask you a couple of questions on some things I am missing:

Every time I launch the script using exactly the same 2 phrases, the results slightly change. What can cause this strange behaviour?
Could you please clarify more on how to handle the no_idf parameter? For now it is like:

no_idf = True if len(refs) == 1 else False

Thanks a lot for your help!

from bert_score.

[Question] Cross-lingual Score about bert_score HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent