Comments (10)
Just a heads up. The absolute score may be less meaningful because they can have different ranges. Ideally, you would like to show that the Score correlates with human judgment, which unfortunately I don't know any.
As I understand, you don't have any more implementation questions so I am closing this.
I am happy to chat about the potential research opportunities for the cross-lingual scores. Feel free to continue the conversation under this issue or contact us directly through emails if you want to keep it private.
from bert_score.
We just saw this https://github.com/facebookresearch/XLM implementation of a cross-lingual Language Model based on BERT.
It seems that the XNLI-15 model could be a nice first solution:
XNLI-15 is the model used in the paper for XNLI fine-tuning.
It handles English, French, Spanish, German, Greek, Bulgarian,
Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu.
It is trained with the MLM and the TLM objectives.
For this model we used a different preprocessing than for the MT models (such as lowercasing and accents removal).
There is an example on how it works.
Could you consider implementing this in your library?
from bert_score.
Thank you @shoegazerstella for letting us know. We are definitely going to look into it but it may take some time before we get back to you.
If this is really important to your research, I encourage you to fork the repo and start implementing it. The general backend of BERTScore is at https://github.com/Tiiiger/bert_score/blob/master/bert_score/utils.py. Please let me know if you have any questions.
from bert_score.
Hope the added docs can help you.
from bert_score.
Hello,
we also conjecture that this is possible although we have not done a proper study regarding this hypothesis.
from bert_score.
Hi @Tiiiger,
What would you suggest if we want to verify whether this hypothesis is valid? Is it possible to use an existing model or there is the need of a new training on BERT?
I was trying this:
cands = ['hello how are you?', 'cat', 'house']
refs = ['ciao come stai?', 'gatto', 'topo']
P, R, F1 = score(cands, refs, bert="bert-base-multilingual-cased", verbose=True)
P: tensor([0.6788, 0.6708, 0.7756])
R: tensor([0.7002, 0.6582, 0.7756])
F1: tensor([0.6893, 0.6644, 0.7756])
And the confusion matrix plot is this:
Clearly this is not working as the last words house/topo
in the two lists are not the same, but they have the highest score of similarity.
from bert_score.
I am trying to implement the solution discussed. You can find the code here.
Apologies if this is not the most elegant solution but was the fastest for me to test today.
So I am running bert_score_test.py
but I get stuck after the embedding generation, the prints refer to ref_stats
and hyp_stats
in bert_score/utils.py
:
calculating scores...
loading facebook-XLM model..
Loading vocabulary from XLM/models/vocab_xnli_15.txt ...
Read 4622450944 words (121332 unique) from vocabulary file.
Loading codes from XLM/models/codes_xnli_15.txt ...
Read 80000 codes from the codes file.
0%| | 0/1 [00:00<?, ?it/s]generating embeddings from facebook-XLM model..
{'es': 'hola como estas?'}
8
torch.Size([8, 1, 1024])
torch.Size([1, 1024])
tensor([[[-0.0235, -1.1157, 5.5236, ..., -1.7445, 4.6693, -5.4893]],
[[-4.8977, -5.8174, -0.0425, ..., -4.0513, 1.5466, 1.0361]],
[[-3.0278, -2.7101, -6.6004, ..., -2.3234, 2.0516, 0.8349]],
...,
[[-3.4236, -3.7358, 1.5622, ..., -2.7245, 0.3207, 1.5517]],
[[-1.3155, -2.3146, 0.8112, ..., 1.7799, -0.2109, 4.8358]],
[[-3.9836, -0.7102, 1.4045, ..., -2.2827, 5.0350, 8.0413]]],
grad_fn=<TransposeBackward0>)
generating embeddings from facebook-XLM model..
{'en': 'hello how are you?'}
7
torch.Size([7, 1, 1024])
torch.Size([1, 1024])
tensor([[[-3.6706e+00, -5.1693e+00, 2.3415e+00, ..., -3.3566e+00,
2.2613e+00, 1.2468e+01]],
[[-5.1743e+00, -5.0928e+00, 1.0318e-02, ..., -5.8567e+00,
-3.4373e+00, 6.0835e+00]],
[[-2.3680e+00, -1.0124e+01, 3.8484e+00, ..., -2.8918e+00,
-8.9933e+00, -2.7259e+00]],
...,
[[-7.4168e+00, -3.6042e+00, 3.5969e+00, ..., 3.8602e+00,
-6.1241e-01, 5.5241e-01]],
[[-1.0793e+00, -4.0387e+00, 5.8260e+00, ..., 3.7948e+00,
2.2968e+00, -1.2407e+01]],
[[-3.8262e+00, -5.0583e+00, 5.7023e+00, ..., -4.9191e-01,
4.4571e+00, -2.0888e+00]]], grad_fn=<TransposeBackward0>)
Traceback (most recent call last):
File "bert_score_test.py", line 13, in <module>
P, R, F1 = score(cands, refs, cands_lang, refs_lang, bert="facebook-XLM", verbose=True, no_idf=no_idf)
File "/src/bert_score/score.py", line 65, in score
verbose=verbose, device=device, batch_size=batch_size)
File "/src/bert_score/utils.py", line 192, in bert_cos_score_idf
P, R, F1 = greedy_cos_idf(*ref_stats, *hyp_stats)
TypeError: greedy_cos_idf() takes 8 positional arguments but 15 were given
I am printing the size of the tensors to compare with your implementation. Is this error related to their shape?
By running bert_score_test.py
using bert-base-multilingual-cased
I get this:
The pre-trained model you are loading is a cased model but you have not set `do_lower_case` to False. We are setting `do_lower_case=False` for you but you may want to check this behavior.
calculating scores...
0%| | 0/1 [00:00<?, ?it/s]<class 'tuple'>
4
torch.Size([1, 7, 768])
(tensor([[[-0.0576, -0.0147, 0.0266, ..., 0.8648, 1.4775, -0.6607],
[ 0.0755, 0.0690, -0.3626, ..., 0.3833, 1.1005, 0.4550],
[ 0.1177, 0.3928, 0.2649, ..., -0.5387, 0.6300, 0.0785],
...,
[ 0.2860, 0.2741, 0.0339, ..., 0.5458, 0.6054, 0.3276],
[ 0.4984, 0.4997, 0.1665, ..., -0.2474, 0.7287, -0.1994],
[ 0.2426, -0.3561, 0.9417, ..., 0.2333, 1.1731, -0.5414]]]), tensor([7]), tensor([[1, 1, 1, 1, 1, 1, 1]]), tensor([[0., 1., 1., 1., 1., 1., 0.]]))
<class 'tuple'>
4
torch.Size([1, 8, 768])
(tensor([[[ 0.1399, 0.0636, -0.3477, ..., 1.2948, 1.3497, -0.8473],
[ 0.5074, 0.4649, 0.0511, ..., 1.7759, 1.0347, -0.7468],
[ 0.6105, 0.7187, 0.1068, ..., 0.8703, 0.7290, -0.4718],
...,
[ 0.0584, 0.8752, 0.4854, ..., 0.8477, -0.3838, -0.3481],
[ 0.7315, 0.2678, 0.0808, ..., -0.2716, 0.4328, -0.6448],
[ 0.6694, -0.4003, 0.9021, ..., 0.4409, 0.8974, -0.7192]]]), tensor([8]), tensor([[1, 1, 1, 1, 1, 1, 1, 1]]), tensor([[0., 1., 1., 1., 1., 1., 1., 0.]]))
100%|############################################################################################################################| 1/1 [00:01<00:00, 1.52s/it]
done in 1.57 seconds
['hello how are you?']
['hola como estas?']
P: tensor([0.7123])
R: tensor([0.7264])
F1: tensor([0.7193])
so here instead, ref_stats
and hyp_stats
are tuples.
The code for XLM embedding generation is here.
Do you have any tip on this? Thanks a lot for your help!
from bert_score.
I think you gave the wrong number of arguments to greedy_cos_idf
.
I will add documentation to utils.py
by tonight. Hang on.
from bert_score.
Hi @Tiiiger, thanks a lot for the docs, it helped a lot.
I managed to make it work but I am still missing something.
I am a bit struggling on how to correctly compute the idf_dict
from https://github.com/facebookresearch/XLM.
As of now I have these very bad results:
['hello how are you?']
['hola como estas?']
XLM:
P: tensor([0.6018])
R: tensor([0.6582])
F1: tensor([0.6287])
bert-base-multilingual-cased:
P: tensor([0.7123])
R: tensor([0.7264])
F1: tensor([0.7193])
Seems that bert-base-multilingual-cased
is performing way better.
from bert_score.
Hi @Tiiiger,
I had to modify something in the XLM vocabulary and use the bpe as tokenizer. You can see some more changes here. The code still does not work for more than 1 reference phrase at a time though.
I would like to ask you a couple of questions on some things I am missing:
- Every time I launch the script using exactly the same 2 phrases, the results slightly change. What can cause this strange behaviour?
- Could you please clarify more on how to handle the
no_idf
parameter? For now it is like:
no_idf = True if len(refs) == 1 else False
Thanks a lot for your help!
from bert_score.
Related Issues (20)
- BERTScore with covid-twitter-bert
- Padding token ID HOT 2
- Run bert_score on TPU (google colab) instead of gpu.
- Calculate embeddings once HOT 1
- Tensor size error with multiple refs tweets
- Can bertscore calculate semantic similarity between languages?
- AttributeError: 'RobertaTokenizerFast' object has no attribute 'max_len' HOT 1
- Different hug_trans version cause different BertScore HOT 6
- BertScore giving different results each time
- Semantic similarity between essays and a theme HOT 1
- Model and Tokenizer change HOT 2
- not able to build wheel
- Question regarding semantic correctness of sentences
- Trying Custom BERT model HOT 1
- Coco challenge link expired
- Why is matplotlib needed?
- Nan of P score?
- Tensor dimension mismatch when using lang='en-sci' HOT 2
- Cache embeddings HOT 2
- DeBERTaV3 Support HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bert_score.