Comments (2)
Hi @g32M7fT6b8Y ,
We believe this is mostly an issue of usage, not a weakness in the method itself. Indeed, we have found that BERTScore computed with deep contextual embedding models can sometimes have a small numerical range (also pointed out by #20 ). However, this does not suggest that BERTScore cannot distinguish bad candidates (bad responses in your case) from good candidates. If we rank the candidates, the good candidates would score higher than the bad candidates. On this note, we also refer you to the correlation studies in our paper.
We also don’t want to simply ignore this “numerical range” problem because it hinders the readability of our method. After rounds of considerations, here’s what we propose:
We take a large monolingual corpus and randomly assign sentences to be candidate-reference pairs. When we evaluate these pairs with BERTScore, the output score (averaged) should serve as a lower bound because the candidate and reference are irrelevant to each other. We propose to use this lower bound to rescale BERTScore. To do this, we subtract this lower bound from a BERTScore and divide the difference by 1-lower bound
.
For some numbers:
On the WMT17 news crawl English corpus, a lower bound for BERTScore computed with RoBERTa-Large is 0.83. With this recalling, the average BERTScore on the WMT18 De-EN translation evaluation dataset drops from 0.9311 to 0.5758. For a concrete example, let's look at the example mentioned in #20. Before this rescaling the score distribution is like this:
After rescaling, this distribution looks like this:
Note that this modification would only change the range of BERTScore and won’t affect BERTScore’s correlation with human judgment. Currently, we are adding software support in this repo. Stay tuned and we’ll push this change into the new version soon.
I am closing this issue but feel free to continue the thread here.
from bert_score.
Thank you for your response.
I think it maybe a appropriate way to alleviate this issue.
Cannot wait to try the new version of BERTScore.
from bert_score.
Related Issues (20)
- BERTScore with covid-twitter-bert
- Padding token ID HOT 2
- Run bert_score on TPU (google colab) instead of gpu.
- Calculate embeddings once HOT 1
- Tensor size error with multiple refs tweets
- Can bertscore calculate semantic similarity between languages?
- AttributeError: 'RobertaTokenizerFast' object has no attribute 'max_len' HOT 1
- Different hug_trans version cause different BertScore HOT 6
- BertScore giving different results each time
- Semantic similarity between essays and a theme HOT 1
- Model and Tokenizer change HOT 2
- not able to build wheel
- Question regarding semantic correctness of sentences
- Trying Custom BERT model HOT 1
- Coco challenge link expired
- Why is matplotlib needed?
- Nan of P score?
- Tensor dimension mismatch when using lang='en-sci' HOT 2
- Cache embeddings HOT 2
- DeBERTaV3 Support HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bert_score.