Benchmark tests? about comet HOT 6 CLOSED

unbabel commented on May 9, 2024

Benchmark tests?

from comet.

Comments (6)

ricardorei commented on May 9, 2024

can you share a bit more details?

For example:

exact command you are calling
samples (I don't need the entire test set but just a few samples that are causing different outputs)
the output that is being produced?

It is possible to have small differences on the same samples (things like 0.3456 vs 0.3462) but more than that, something is wrong...

from comet.

cfberger commented on May 9, 2024

Thanks for the quick response. The command I'm calling follows the examples in the README. E.g.
comet-score -s SRCA.txt -t HTA1_MT_EN-US_-ZH-CN.txt --model wmt20-comet-qe-da
SRCA.txt is a plain text file in English. HTA1_MT_EN-US-_ZH-CN.txt is a translation in Chinese.
For the first segment (sentence) I get the scores: 0.2642 and 0.4263, respectively so fairly different and not a rounding issue.
This segment/sentence in SRCA is "From the factory floor to air-to-air combat, artificial intelligence will soon replace humans not only in jobs that involve basic, repetitive tasks but advanced analytical and decision-making skills."
The target text is "不久之后，人工智能将在许多工作中取代人类，从工厂车间到空对空作战，不仅基本和重复性的工作将由人工智能进行，需要高级分析和决策技能的工作亦是如此。" (I don't speak/read Chinese, so I don't have a clue what it says.)

from comet.

ricardorei commented on May 9, 2024

The output in my computer was:
mt.txt Segment 0 score: 0.4283
mt.txt score: 0.4283

from comet.

ricardorei commented on May 9, 2024

On a completely different machine, using a GPU, I get the same output:
mt.txt Segment 0 score: 0.4283
mt.txt score: 0.4283

from comet.

ricardorei commented on May 9, 2024

I don't get where the 0.2642 score comes from... I can't replicate it with your example

from comet.

cfberger commented on May 9, 2024

Thanks, yes, that's what I'm getting on my Mac, too. I get the other score (0.2642) on my Linux machine, where I had some error messages while trying to install comet because of package incompatibilities. I had to wipe those packages and reinstall. This leads me to think that there's some residual incompatibility that wasn't resolved which leads to random result numbers. Which is why I was asking for a benchmark to figure out which results to trust. Since I'm getting 0.4283 (or 0.4263 with a newline), that's it. Thanks.

from comet.

Recommend Projects

Benchmark tests? about comet HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent