Hi, thank you for your wonderful repo. In my view, I think BERTScore is a kind of

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

About the fatal weakness of the Embedding-based metric about bert_score HOT 2 CLOSED

tiiiger commented on July 18, 2024

About the fatal weakness of the Embedding-based metric

from bert_score.

Comments (2)

Tiiiger commented on July 18, 2024

Hi @g32M7fT6b8Y ,

We believe this is mostly an issue of usage, not a weakness in the method itself. Indeed, we have found that BERTScore computed with deep contextual embedding models can sometimes have a small numerical range (also pointed out by #20 ). However, this does not suggest that BERTScore cannot distinguish bad candidates (bad responses in your case) from good candidates. If we rank the candidates, the good candidates would score higher than the bad candidates. On this note, we also refer you to the correlation studies in our paper.

  We also don’t want to simply ignore this “numerical range” problem because it hinders the readability of our method. After rounds of considerations, here’s what we propose:  

We take a large monolingual corpus and randomly assign sentences to be candidate-reference pairs. When we evaluate these pairs with BERTScore, the output score (averaged) should serve as a lower bound because the candidate and reference are irrelevant to each other. We propose to use this lower bound to rescale BERTScore. To do this, we subtract this lower bound from a BERTScore and divide the difference by 1-lower bound.

For some numbers:
On the WMT17 news crawl English corpus, a lower bound for BERTScore computed with RoBERTa-Large is 0.83. With this recalling, the average BERTScore on the WMT18 De-EN translation evaluation dataset drops from 0.9311 to 0.5758. For a concrete example, let's look at the example mentioned in #20. Before this rescaling the score distribution is like this:

After rescaling, this distribution looks like this:

  Note that this modification would only change the range of BERTScore and won’t affect BERTScore’s correlation with human judgment. Currently, we are adding software support in this repo. Stay tuned and we’ll push this change into the new version soon.   

I am closing this issue but feel free to continue the thread here.

from bert_score.

gmftbyGMFTBY commented on July 18, 2024

Thank you for your response.
I think it maybe a appropriate way to alleviate this issue.
Cannot wait to try the new version of BERTScore.

from bert_score.

Recommend Projects

About the fatal weakness of the Embedding-based metric about bert_score HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent