Giter Club home page Giter Club logo

Comments (10)

Tiiiger avatar Tiiiger commented on July 18, 2024 2

Just a heads up. The absolute score may be less meaningful because they can have different ranges. Ideally, you would like to show that the Score correlates with human judgment, which unfortunately I don't know any.

As I understand, you don't have any more implementation questions so I am closing this.

I am happy to chat about the potential research opportunities for the cross-lingual scores. Feel free to continue the conversation under this issue or contact us directly through emails if you want to keep it private.

from bert_score.

shoegazerstella avatar shoegazerstella commented on July 18, 2024 1

We just saw this https://github.com/facebookresearch/XLM implementation of a cross-lingual Language Model based on BERT.
It seems that the XNLI-15 model could be a nice first solution:

XNLI-15 is the model used in the paper for XNLI fine-tuning.
It handles English, French, Spanish, German, Greek, Bulgarian, 
Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu.
It is trained with the MLM and the TLM objectives. 
For this model we used a different preprocessing than for the MT models (such as lowercasing and accents removal).

There is an example on how it works.

Could you consider implementing this in your library?

from bert_score.

Tiiiger avatar Tiiiger commented on July 18, 2024 1

Thank you @shoegazerstella for letting us know. We are definitely going to look into it but it may take some time before we get back to you.

If this is really important to your research, I encourage you to fork the repo and start implementing it. The general backend of BERTScore is at https://github.com/Tiiiger/bert_score/blob/master/bert_score/utils.py. Please let me know if you have any questions.

from bert_score.

Tiiiger avatar Tiiiger commented on July 18, 2024 1

Hope the added docs can help you.

from bert_score.

Tiiiger avatar Tiiiger commented on July 18, 2024

Hello,

we also conjecture that this is possible although we have not done a proper study regarding this hypothesis.

from bert_score.

shoegazerstella avatar shoegazerstella commented on July 18, 2024

Hi @Tiiiger,
What would you suggest if we want to verify whether this hypothesis is valid? Is it possible to use an existing model or there is the need of a new training on BERT?

I was trying this:

cands = ['hello how are you?', 'cat', 'house']
refs = ['ciao come stai?', 'gatto', 'topo']

P, R, F1 = score(cands, refs, bert="bert-base-multilingual-cased", verbose=True)

P: tensor([0.6788, 0.6708, 0.7756])
R: tensor([0.7002, 0.6582, 0.7756])
F1: tensor([0.6893, 0.6644, 0.7756])

And the confusion matrix plot is this:
Screenshot 2019-05-02 at 16 47 00

Clearly this is not working as the last words house/topo in the two lists are not the same, but they have the highest score of similarity.

from bert_score.

shoegazerstella avatar shoegazerstella commented on July 18, 2024

I am trying to implement the solution discussed. You can find the code here.

Apologies if this is not the most elegant solution but was the fastest for me to test today.

So I am running bert_score_test.py but I get stuck after the embedding generation, the prints refer to ref_stats and hyp_stats in bert_score/utils.py:

calculating scores...
loading facebook-XLM model..
Loading vocabulary from XLM/models/vocab_xnli_15.txt ...
Read 4622450944 words (121332 unique) from vocabulary file.
Loading codes from XLM/models/codes_xnli_15.txt ...
Read 80000 codes from the codes file.
  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]generating embeddings from facebook-XLM model..
{'es': 'hola como estas?'}
8
torch.Size([8, 1, 1024])
torch.Size([1, 1024])
tensor([[[-0.0235, -1.1157,  5.5236,  ..., -1.7445,  4.6693, -5.4893]],

        [[-4.8977, -5.8174, -0.0425,  ..., -4.0513,  1.5466,  1.0361]],

        [[-3.0278, -2.7101, -6.6004,  ..., -2.3234,  2.0516,  0.8349]],

        ...,

        [[-3.4236, -3.7358,  1.5622,  ..., -2.7245,  0.3207,  1.5517]],

        [[-1.3155, -2.3146,  0.8112,  ...,  1.7799, -0.2109,  4.8358]],

        [[-3.9836, -0.7102,  1.4045,  ..., -2.2827,  5.0350,  8.0413]]],
       grad_fn=<TransposeBackward0>)
generating embeddings from facebook-XLM model..
{'en': 'hello how are you?'}
7
torch.Size([7, 1, 1024])
torch.Size([1, 1024])
tensor([[[-3.6706e+00, -5.1693e+00,  2.3415e+00,  ..., -3.3566e+00,
           2.2613e+00,  1.2468e+01]],

        [[-5.1743e+00, -5.0928e+00,  1.0318e-02,  ..., -5.8567e+00,
          -3.4373e+00,  6.0835e+00]],

        [[-2.3680e+00, -1.0124e+01,  3.8484e+00,  ..., -2.8918e+00,
          -8.9933e+00, -2.7259e+00]],

        ...,

        [[-7.4168e+00, -3.6042e+00,  3.5969e+00,  ...,  3.8602e+00,
          -6.1241e-01,  5.5241e-01]],

        [[-1.0793e+00, -4.0387e+00,  5.8260e+00,  ...,  3.7948e+00,
           2.2968e+00, -1.2407e+01]],

        [[-3.8262e+00, -5.0583e+00,  5.7023e+00,  ..., -4.9191e-01,
           4.4571e+00, -2.0888e+00]]], grad_fn=<TransposeBackward0>)
Traceback (most recent call last):
  File "bert_score_test.py", line 13, in <module>
    P, R, F1 = score(cands, refs, cands_lang, refs_lang, bert="facebook-XLM", verbose=True, no_idf=no_idf) 
  File "/src/bert_score/score.py", line 65, in score
    verbose=verbose, device=device, batch_size=batch_size)
  File "/src/bert_score/utils.py", line 192, in bert_cos_score_idf
    P, R, F1 = greedy_cos_idf(*ref_stats, *hyp_stats)
TypeError: greedy_cos_idf() takes 8 positional arguments but 15 were given

I am printing the size of the tensors to compare with your implementation. Is this error related to their shape?
By running bert_score_test.py using bert-base-multilingual-cased I get this:

The pre-trained model you are loading is a cased model but you have not set `do_lower_case` to False. We are setting `do_lower_case=False` for you but you may want to check this behavior.
calculating scores...
  0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]<class 'tuple'>
4
torch.Size([1, 7, 768])
(tensor([[[-0.0576, -0.0147,  0.0266,  ...,  0.8648,  1.4775, -0.6607],
         [ 0.0755,  0.0690, -0.3626,  ...,  0.3833,  1.1005,  0.4550],
         [ 0.1177,  0.3928,  0.2649,  ..., -0.5387,  0.6300,  0.0785],
         ...,
         [ 0.2860,  0.2741,  0.0339,  ...,  0.5458,  0.6054,  0.3276],
         [ 0.4984,  0.4997,  0.1665,  ..., -0.2474,  0.7287, -0.1994],
         [ 0.2426, -0.3561,  0.9417,  ...,  0.2333,  1.1731, -0.5414]]]), tensor([7]), tensor([[1, 1, 1, 1, 1, 1, 1]]), tensor([[0., 1., 1., 1., 1., 1., 0.]]))
<class 'tuple'>
4
torch.Size([1, 8, 768])
(tensor([[[ 0.1399,  0.0636, -0.3477,  ...,  1.2948,  1.3497, -0.8473],
         [ 0.5074,  0.4649,  0.0511,  ...,  1.7759,  1.0347, -0.7468],
         [ 0.6105,  0.7187,  0.1068,  ...,  0.8703,  0.7290, -0.4718],
         ...,
         [ 0.0584,  0.8752,  0.4854,  ...,  0.8477, -0.3838, -0.3481],
         [ 0.7315,  0.2678,  0.0808,  ..., -0.2716,  0.4328, -0.6448],
         [ 0.6694, -0.4003,  0.9021,  ...,  0.4409,  0.8974, -0.7192]]]), tensor([8]), tensor([[1, 1, 1, 1, 1, 1, 1, 1]]), tensor([[0., 1., 1., 1., 1., 1., 1., 0.]]))
100%|############################################################################################################################| 1/1 [00:01<00:00,  1.52s/it]
done in 1.57 seconds
['hello how are you?']
['hola como estas?']
P: tensor([0.7123])
R: tensor([0.7264])
F1: tensor([0.7193])

so here instead, ref_stats and hyp_stats are tuples.
The code for XLM embedding generation is here.
Do you have any tip on this? Thanks a lot for your help!

from bert_score.

Tiiiger avatar Tiiiger commented on July 18, 2024

I think you gave the wrong number of arguments to greedy_cos_idf.

I will add documentation to utils.py by tonight. Hang on.

from bert_score.

shoegazerstella avatar shoegazerstella commented on July 18, 2024

Hi @Tiiiger, thanks a lot for the docs, it helped a lot.
I managed to make it work but I am still missing something.
I am a bit struggling on how to correctly compute the idf_dict from https://github.com/facebookresearch/XLM.

As of now I have these very bad results:

['hello how are you?']
['hola como estas?']

XLM:
P: tensor([0.6018])
R: tensor([0.6582])
F1: tensor([0.6287])

bert-base-multilingual-cased:
P: tensor([0.7123])
R: tensor([0.7264])
F1: tensor([0.7193])

Seems that bert-base-multilingual-cased is performing way better.

from bert_score.

shoegazerstella avatar shoegazerstella commented on July 18, 2024

Hi @Tiiiger,

I had to modify something in the XLM vocabulary and use the bpe as tokenizer. You can see some more changes here. The code still does not work for more than 1 reference phrase at a time though.

I would like to ask you a couple of questions on some things I am missing:

  • Every time I launch the script using exactly the same 2 phrases, the results slightly change. What can cause this strange behaviour?
  • Could you please clarify more on how to handle the no_idf parameter? For now it is like:
no_idf = True if len(refs) == 1 else False

Thanks a lot for your help!

from bert_score.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.