unbabel / comet Goto Github PK

View Code? Open in Web Editor NEW

396.0 17.0 67.0 9.74 MB

A Neural Framework for MT Evaluation

Home Page: https://unbabel.github.io/COMET/html/index.html

License: Apache License 2.0

Python 100.00%

machine-translation evaluation-metrics natural-language-processing machine-learning artificial-intelligence nlp

comet's People

Contributors

Stargazers

Watchers

comet's Issues

Comet QE asks for the reference

Using the comet score with QE model (wmt-large-qe-estimator-1719) through the command line asks for a reference, even when it shouldn't be used in the calculation (if I am not mistaken)

COMET not working with python 3.8

🐛 Bug

COMET works with python 3.6 but not with python 3.8

Environment

OS: MacOS
Packaging: pip
Version 1.0.0rc6

To Reproduce

I pip installed comet on my python 3.8.12 virtual environment and then tested the example provided in the readme Scoring with Python

seg_scores, sys_score = model.predict(data, batch_size=8, gpus=0)

But I get the following error:

/usr/local/Cellar/[email protected]/3.8.12/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py in _launch(self, process_obj)
     45         try:
     46             reduction.dump(prep_data, fp)
---> 47             reduction.dump(process_obj, fp)
     48         finally:
     49             set_spawning_popen(None)

/usr/local/Cellar/[email protected]/3.8.12/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/reduction.py in dump(obj, file, protocol)
     58 def dump(obj, file, protocol=None):
     59     '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60     ForkingPickler(file, protocol).dump(obj)
     61 
     62 #

AttributeError: Can't pickle local object 'CometModel.predict.<locals>.<lambda>'

I tested the same code on python 3.6 and it did work, so thanks a lot :)

I expected comet to work with >=python3.5
Is there any plan to make it work for python 3.8?

Thanks again.

Protect example inside main

Dear authors,
Thanks a lot for COMET and outsourcing your code.

🐛 Bug

Running word_level QE estimation trigger the following error:

Traceback (most recent call last):
File "", line 1, in
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/runpy.py", line 263, in run_path
return _run_module_code(code, init_globals, run_name,
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/Users/Ebenge/Desktop/quality_estimation/examples/word_level/wmt_2018/de_en/microtransquest.py", line 52, in
sources_tags, targets_tags = model.predict(test_sentences[:1], split_on_space=True)
File "/Users/Ebenge/Desktop/quality_estimation/transquest/algo/word_level/microtransquest/run_model.py", line 991, in predict
eval_dataset = self.load_and_cache_examples(None, to_predict=predict_examples)
File "/Users/Ebenge/Desktop/quality_estimation/transquest/algo/word_level/microtransquest/run_model.py", line 1203, in load_and_cache_examples
features = convert_examples_to_features(
File "/Users/Ebenge/Desktop/quality_estimation/transquest/algo/word_level/microtransquest/utils.py", line 345, in convert_examples_to_features
with Pool(process_count) as p:
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 212, in init
self._repopulate_pool()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 303, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 326, in _repopulate_pool_static
w.start()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 283, in _Popen
Traceback (most recent call last):
File "", line 1, in
return Popen(process_obj)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in init
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
super().init(process_obj)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data
exitcode = _main(fd, parent_sentinel)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
_check_not_importing_main()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main
prepare(preparation_data)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.
_fixup_main_from_path(data['init_main_from_path'])

To Reproduce

python microtransquest.py (the one in examples/word_level/wmt_2018/de_en)

Expected behaviour

No error

Fix

Wrap everything inside a main 👍

Environment

OS: OS
certifi==2021.10.8
charset-normalizer==2.0.7
click==8.0.3
configparser==5.1.0
cycler==0.11.0
docker-pycreds==0.4.0
filelock==3.4.0
flatbuffers==2.0
fonttools==4.28.2
gitdb==4.0.9
GitPython==3.1.24
huggingface-hub==0.1.2
idna==3.3
joblib==1.1.0
kiwisolver==1.3.2
matplotlib==3.5.0
numpy==1.21.4
onnxruntime==1.9.0
packaging==21.3
pandas==1.3.4
pathtools==0.1.2
Pillow==8.4.0
promise==2.3
protobuf==3.19.1
psutil==5.8.0
pyparsing==3.0.6
python-dateutil==2.8.2
pytz==2021.3
PyYAML==6.0
regex==2021.11.10
requests==2.26.0
sacremoses==0.0.46
scikit-learn==1.0.1
scipy==1.7.2
sentencepiece==0.1.96
sentry-sdk==1.5.0
seqeval==1.2.2
setuptools-scm==6.3.2
shortuuid==1.0.8
six==1.16.0
smmap==5.0.0
subprocess32==3.5.4
tensorboardX==2.4.1
termcolor==1.1.0
threadpoolctl==3.0.0
tokenizers==0.10.3
tomli==1.2.2
torch==1.10.0
tqdm==4.62.3
transformers==4.12.5
typing-extensions==4.0.0
urllib3==1.26.7
wandb==0.12.7
yaspin==2.1.0

Additional context

Fix solve the pb.

[QUESTION] Does COMET work on windows?

COMET installation is failing on windows. Could you please take a look?

(base) C:\>conda create --name comet_windows_3_7 python=3.7
(base) C:\>conda activate comet_windows_3_7
(comet_windows_3_7) C:\>pip install unbabel-comet

Using cached test_tube-0.7.4.tar.gz (21 kB)
    ERROR: Command errored out with exit status 1:
     command: 'C:\Users\test\Anaconda3\envs\comet_windows_3_7\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\test\\AppData\\Local\\Temp\\pip-install-68_b6icx\\test-tube_f70f8dd226d64a01b89decde5fae3cab\\setup.py'"'"'; __file__='"'"'C:\\Users\\test\\AppData\\Local\\Temp\\pip-install-68_b6icx\\test-tube_f70f8dd226d64a01b89decde5fae3cab\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\test\AppData\Local\Temp\pip-pip-egg-info-ko3bque_'
         cwd: C:\Users\test\AppData\Local\Temp\pip-install-68_b6icx\test-tube_f70f8dd226d64a01b89decde5fae3cab\
    Complete output (7 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\test\AppData\Local\Temp\pip-install-68_b6icx\test-tube_f70f8dd226d64a01b89decde5fae3cab\setup.py", line 28, in <module>
        install_requires=load_requirements(PATH_ROOT),
      File "C:\Users\test\AppData\Local\Temp\pip-install-68_b6icx\test-tube_f70f8dd226d64a01b89decde5fae3cab\setup.py", line 10, in load_requirements
        with open(os.path.join(path_dir, 'requirements.txt'), 'r') as file:
    FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\test\\AppData\\Local\\Temp\\pip-install-68_b6icx\\test-tube_f70f8dd226d64a01b89decde5fae3cab\\requirements.txt'

Add WMT test sets via sacrebleu

[Extracting from #30]

It would be nice to would be to add support for sacrebleu-style builtin test sets, e.g.,

# one option
$ cat system.txt | comet -t wmt20 -l de-en [other args]

# another option
$ cat system.txt | comet --sacrebleu-testset wmt20/de-en
$ cat system.txt | comet --sacrebleu-testset mtedx/valid/pt-es

You could accomplish this by just using sacrebleu as a library. It’s pretty easy:

from sacrebleu.utils import get_source, get_references, get_files

# trigger sacrebleu test set
# make these optional: nargs=“?” for argparse
if args.source is None and args.references is None:
    if args.sacrebleu_dataset is None:
        # throw error

    # some test sets are hierarchical, e.g., “mtedx/valid”
    test_set, langpair = args.sacrebleu_dataset.rsplit(“/“, maxsplit=1)
    source = get_source(test_set, langpair)
    ref = get_referencees(test_set, langpair)

     # alternative
    source, ref, _ = get_files(test_set, langpair)

Originally posted by @mjpost in #30 (comment)

[QUESTION] What is the data format for QE ranker model?

I want to train my own metric which is a ranker model and referenceless. According to https://github.com/Unbabel/COMET/blob/0.1.0/docs/source/training.md the data format is a csv file with src, mt, ref and score. Because the ranker model needs to have a pos hypothesis and a neg hypothesis to train and as I understand it doesn't need score to train so is the data format the same for training ranker model?

Error related to incomplete model downloads in cache

When a model download is halted before it completes, and then a new command is used referring to the same model (e.g. the default comet score -s src.de -h hyp.en -r ref.en , the script will try to retrieve the cached (incomplete) download and will result in an error:
Exception: [meta_tags.csv|hparams.yaml is missing from the checkpoint folder.

It is resolved is the cache is cleared

Full error trace:


 Traceback (most recent call last):
  File "/home/chryssa/anaconda3/bin/comet", line 11, in <module>
    load_entry_point('unbabel-comet==0.0.7', 'console_scripts', 'comet')()
  File "/home/chryssa/anaconda3/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/chryssa/anaconda3/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/chryssa/anaconda3/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/chryssa/anaconda3/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/chryssa/anaconda3/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/chryssa/anaconda3/lib/python3.7/site-packages/unbabel_comet-0.0.7-py3.7.egg/comet/cli.py", line 121, in score
    model = load_checkpoint(model) if os.path.exists(model) else download_model(model)
  File "/home/chryssa/anaconda3/lib/python3.7/site-packages/unbabel_comet-0.0.7-py3.7.egg/comet/models/__init__.py", line 98, in download_model
    return load_checkpoint(checkpoint_path)
  File "/home/chryssa/anaconda3/lib/python3.7/site-packages/unbabel_comet-0.0.7-py3.7.egg/comet/models/__init__.py", line 136, in load_checkpoint
    "[meta_tags.csv|hparams.yaml is missing from the checkpoint folder."

Segmentation fault error

Hello!

🐛 Bug

Whenever I am try to run your demo, when using :

from comet.models import download_model

I get:

Segmentation fault (core dumped)

To Reproduce

I'm using Amazon EC2 instances with ubuntu 18.04 and I've also tried with ubuntu 20.04 and got the same errors.
I log into the instance and do:

sudo su
apt-get update
apt-get install python3-pip
pip3 install unbabel-comet

If I'm on 18.04 I have to pip upgrade because otherwise sentence piece breaks.

I've tried building from source installing from pip nothing seems to be working it installs and I can even do:

import comet

But whenever :

from comet.models import download_model

segmentation fault, also if I use the cli command:

comet download -d apequest --saving_path data/

again the same error!

Environment

OS: Linux
Packaging pip
Version : pip 20.0.2

I've tried multiple instances, virtual environments but nothing seems to be effective! With ubuntu 20.04 I used python 3.8, I've seen that the recommendation is python 3.6, On ubuntu 18.04 I've used the default python3 which is 3.6.9 and still had the same issues!

Here is the output of my pip freeze


absl-py==0.11.0
cachetools==4.1.1
certifi==2020.11.8
cffi==1.14.3
chardet==3.0.4
click==7.1.2
Cython==0.29.15
fairseq==0.9.0
fastBPE==0.1.0
filelock==3.0.12
fsspec==0.8.4
future==0.18.2
google-auth==1.23.0
google-auth-oauthlib==0.4.2
grpcio==1.33.2
idna==2.10
joblib==0.17.0
Markdown==3.3.3
numpy==1.19.4
oauthlib==3.1.0
pandas==1.0.5
portalocker==2.0.0
protobuf==3.14.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.20
python-dateutil==2.8.1
pytorch-lightning==1.0.7
pytorch-nlp==0.5.0
pytz==2020.4
PyYAML==5.3.1
regex==2020.11.13
requests==2.25.0
requests-oauthlib==1.3.0
rsa==4.6
sacrebleu==1.4.14
sacremoses==0.0.43
scikit-learn==0.23.1
scipy==1.5.4
sentencepiece==0.1.94
six==1.15.0
sphinx-markdown-tables==0.0.15
tensorboard==2.2.0
tensorboard-plugin-wit==1.7.0
threadpoolctl==2.1.0
tokenizers==0.7.0
torch==1.4.0
tqdm==4.52.0
transformers==2.10.0
-e git+https://github.com/Unbabel/COMET@c9ac4c9cbdb8484aa5ee286c9cbe13002c16a193#egg=unbabel_comet
urllib3==1.26.2
Werkzeug==1.0.1
wget==3.2

If you could help me figure this thing out it would be wonderful!

Thank you for your time and willingness to share your tool! I'm eager to try it :-)

wrong output of comet-compare

🐛 Bug

comet-compare output wrong info in the json file:

the same src,ref, x and y text are reported
the scores are probably correct

I think that the problems are due to this line

COMET/comet/cli/compare.py

Line 142 in 6009f67

"src": system_x[0]["src"],

and the 3 following lines

It seems that the entry 0 is always output, while it should output the entry "i".

To Reproduce

comet-compare -s SRC.txt -r REF.txt -x SysX.txt -y SysY.txt --to_json JSON.txt

SRC.txt, REF.txt, SysX.txt and SysY.txt must have more than one lines.

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

OS: ubuntu
Packaging: pip3
Version: unbabel-comet==1.0.0rc8

Which scale is COMET? [0,1]?

Is COMET on a 0-1 scale? Is it normalized to [0,1]? I saw on the paper of uncertainty-aware COMET, there is a COMET score that is bigger than 1, how is it so? Is that normal/frequent?

Get list of models

🚀 Feature

COMET should output a list of available models if -m is used with an invalid model name.

Motivation

It's a bit of a pain to figure out the available models from the CLI. I had to come to the Github page.

Alternatives

Additional context

[QUESTION] Does COMET support multiple references?

Thanks for the tool! I'm wondering if COMET supports multiple references, or if we can just score each sentence with all the references and take the maximum value? Sorry if this has already been mentioned somewhere.

Why is COMET_DA default model?

Hi Unbabel,

I wanted to ask, why is the wmt20-comet-da default model when using COMET? I am a bit worried that people using it off the shelf won't understand the underlying difference and will mistakingly report COMET scores on the QE metric.

Why not set the reference-based COMET as the default, since it seems to be outperforming comet-da (also I fear that comet-da will have more biases and potential problems than reference-based).

Thank you for the answer,
Tom

Statistical testing between two systems using T-Test

🚀 Feature

Instead of bootstrap resampling, we could simply run paired t-test to check if a system is significantly better than other.

Motivation

Simples to understand.

TypeError: 'NoneType' object is not subscriptable when calling comet-score

🐛 Bug

To Reproduce

Using comet 1.0.1 with Python 3.9, I get the following (using testset in /data of mt-telescope):

$ comet-score -s newstest2020-ruen.src.ru.txt -t newstest2020-ruen.OnlineA.txt -r newstest2020-ruen.ref.en.txt
Global seed set to 12
wmt20-comet-da is already in cache.
Traceback (most recent call last):
File ".../telescope-venv/bin/comet-score", line 8, in
sys.exit(score_command())
File ".../telescope-venv/lib/python3.9/site-packages/comet/cli/score.py", line 180, in score_command
model = load_from_checkpoint(model_path)
File ".../telescope-venv/lib/python3.9/site-packages/comet/models/init.py", line 57, in load_from_checkpoint
model_class = str2model[hparams["class_identifier"]]
TypeError: 'NoneType' object is not subscriptable

Expected behaviour

Output comet score.

Environment

OS: Linus
Packaging Pip
Version 1.0.1

Error when running inference on newly trained metric

While inference (comet-score) using the provided metrics works well, if I try to train a new metric with comet-train and the use it to predict quality scores I get the following error:

comet-score: error: argument --model: invalid choice: 'lightning_logs/version_19/checkpoints/epoch=1-step=129339.ckpt' (choose from 'emnlp20-comet-rank', 'wmt20-comet-da', 'wmt20-comet-qe-da', 'wmt21-cometinho-da')

The error disappears once I comment out line 64: choices=available_metrics.keys(), in comet/cli/score.py

parser.add_argument(
        "--model",
        type=Union[str, Path_fr],
        required=False,
        default="wmt20-comet-da",
        choices=available_metrics.keys(),
        help="COMET model to be used.",
    )

pip install conflict

When I tried a clean install of current COMET version via pip, I got an error that there is a conflict in versions. I managed to resolve it by manually degrading PyYAML to 3.3.*

Could you check it, please?

GPU support information into README

COMET takes 30-40 minutes to evaluate 400 sentences testset on CPU. Therefore GPU is necessary, but it took me some time before I found out there is a flag "cuda". Can you add an example with cuda parametr into README?

Refless example doesn't work with 1.0.0rc4

🐛 Bug

To Reproduce

Following the README:

> echo -e "Dem Feuer konnte Einhalt geboten werden\nSchulen und Kindergärten wurden eröffnet." >> src.de
> echo -e "The fire could be stopped\nSchools and kindergartens were open" >> hyp.en
> comet-score -s src.de -t hyp.en --model wmt20-comet-qe-da

results in the following output:

Global seed set to 12
usage: comet-score [-h] [-s SOURCES] [-t TRANSLATIONS] [-r REFERENCES] [--batch_size BATCH_SIZE] [--gpus GPUS]
                   [--to_json TO_JSON]
                   [--model {emnlp20-comet-rank,wmt20-comet-da,wmt20-comet-qe-da,wmt21-cometinho-da}]
                   [--mc_dropout MC_DROPOUT] [--seed_everything SEED_EVERYTHING]
comet-score: error: wmt20-comet-qe-da requires -r/--references.

Looking at the code:

COMET/comet/cli/score.py

Lines 79 to 80 in 61caa5a

 if (cfg.references is None) and ("refless" not in cfg.model): 

 parser.error("{} requires -r/--references.".format(cfg.model))

it seems that the model has to have refless in the name, which none of the available models have:

comet-score: error: argument --model: invalid choice: [...] (choose from 'emnlp20-comet-rank', 'wmt20-comet-da', 'wmt20-comet-qe-da', 'wmt21-cometinho-da')

Expected behaviour

I'd hope to get a score for my hypotheses given the sources.

Screenshots

n/a

Environment

OS: MacOS
Packaging: pip
Version: unbabel-comet==1.0.0rc4

Additional context

Multi-GPU support?

🚀 Feature

Multi-GPU support would be nice.

Motivation

Scoring larger test sets takes ages on a single GPU :)

Is there a theoretical range of values for the COMET regressor?

Since the final estimator layer is an FFN https://github.com/Unbabel/COMET/blob/master/comet/models/regression/regression_metric.py#L95

Which goes to

COMET/comet/modules/feedforward.py

Line 58 in 85b0c8f

modules.append(nn.Linear(hidden_sizes[-1], int(out_dim)))

Is the theoretical range (-inf, inf), https://pytorch.org/docs/1.9.1/generated/torch.nn.Linear.html?

Are there some table of practical ranges that the comet owners/contributors/users have found for varying languages and length?

Calling xlm-roberta-large via HuggingFace

🚀 Feature

I would like to use this library for COMET scores but I want to host xlm-roberta-large elsewhere like HuggingFace. How we can enable in this library? Would modifying xlmr.py be enough to call the XLM model that is deployed to a remote GPU server?

Motivation

I would like to place the computationally intensive encoder of the model on a GPU server (for faster batch-inference) that is shared by multiple COMET scorers but also maybe with other applications that may also be benefiting from xlm-roberta-large.

Add Poetry

Use a dependency manager such as Poetry to end with problems with requirements.

Reimplementing results in COMET EMNLP'20 paper

Hi,

Recently I'm reimplementing the results in your COMET EMNLP'20 paper. I also carefully referred to documentation for more details. However, when reimplementing experiments over wmt-metrics data, I find something unexpected. Here are my steps for preparing:

I create a virtual environment with conda:

conda create -n comet python=3.8
conda activate comet
pip install unbabel-comet

Download the wmt-metrics data via:

comet download -d wmt-metrics --saving_path data/wmt-metrics/

After these steps I continue the implementation with the released model:

I want to have a test using language pair en-de, so I first split test19-relative-ranking.csv file into multiple files: storing source, reference, positive_hypothesis, and negative_hypothesis line by line. Here is the content of python source file script_language_filter.py

from argparse import ArgumentParser
from csv import reader

parser = ArgumentParser()

parser.add_argument('--input_file', type=str, required=True)
parser.add_argument('--language', type=str, required=True)
parser.add_argument('--output_src', type=str, required=True)
parser.add_argument('--output_ref', type=str, required=True)
parser.add_argument('--output_pos', type=str, required=True)
parser.add_argument('--output_neg', type=str, required=True)

args = parser.parse_args()


def main():
    with open(args.input_file, mode='r', encoding='utf-8') as f1, \
            open(args.output_src, mode='w', encoding='utf-8') as f2, \
            open(args.output_ref, mode='w', encoding='utf-8') as f3, \
            open(args.output_pos, mode='w', encoding='utf-8') as f4, \
            open(args.output_neg, mode='w', encoding='utf-8') as f5:
        csv_reader = reader(f1)
        next(csv_reader) # escape the first line
        for _, row in enumerate(csv_reader):
            # csv_file title:
            # data, lp, src, ref, pos, neg, pos.model, neg.model, bestmodel
            # indexes of our interest:
            #       1 , 2  , 3  , 4  , 5
            if row[1] != args.language:
                continue

            f2.write(row[2].strip() + '\n')
            f3.write(row[3].strip() + '\n')
            f4.write(row[4].strip() + '\n')
            f5.write(row[5].strip() + '\n')
    return


if __name__ == '__main__':
    main()

Then, run this command:

python script_language_filter.py \
    --input_file test19-relative-ranks.csv \
    --language "de-en" \
    --output_src test19-relative-ranks.src \
    --output_ref test19-relative-ranks.ref \
    --output_pos test19-relative-ranks.pos \
    --output_neg test19-relative-ranks.neg

After this step, I get 4 more files test19-relative-ranks.{src,ref,pos,neg}, each yielding 17,073 lines.

Scoring each sentence pair:
For positive_hypothesis-reference:

comet score -s test19-relative-ranks.src \
    -h test19-relative-ranks.pos \
    -r test19-relative-ranks.ref \
    --batch_size 16 \
    --to_json test19-relative-ranks.pos.json \
    --model emnlp-base-da-ranker

For negative_hypothesis-reference:

comet score -s test19-relative-ranks.src \
    -h test19-relative-ranks.neg \
    -r test19-relative-ranks.ref \
    --batch_size 16 \
    --to_json test19-relative-ranks.neg.json \
    --model emnlp-base-da-ranker

Then I get two files storing predicted scores test19-relative-ranks.{pos,neg}.json.

I didn't find presented script for directly computing WMT DARR Kendall score. So I simply write a script:

from argparse import ArgumentParser
from json import load

parser = ArgumentParser()
parser.add_argument('--pos_json', type=str, required=True)
parser.add_argument('--neg_json', type=str, required=True)

args = parser.parse_args()


def main():
    with open(args.pos_json, mode='r', encoding='utf-8') as f1, \
            open(args.neg_json, mode='r', encoding='utf-8') as f2:
        pos_data = load(f1)
        neg_data = load(f2)

    concor = 0
    discor = 0

    for pos, neg in zip(pos_data, neg_data):
        if pos['predicted_score'] > neg['predicted_score']:
            concor += 1
        else:
            discor += 1

    print('%d items in total. Concor: %d, Discor: %d, WMTKendall: %f' % (concor + discor, concor, discor, (concor - discor) / (concor + discor)))
    return


if __name__ == '__main__':
    main()

And I run:

python script_compute_rr.py \
    --pos_json test19-relative-ranks.pos.json \
    --neg_json test19-relative-ranks.neg.json

The results are:

17073 items in total. Concor: 11244, Discor: 5829, WMTKendall: 0.317167

So here I find my result is extremely higher than the reported result 0.202 in your paper (column de-en, row COMET-RANK in Table 2). I'm not sure which step I'm wrong. Besides, I also want to know whether the model tagged emnlp-base-da-ranker is exactly the well-trained model corresponding to the reported results in Table 2 of your paper.

Could you answer these questions for me? Many thanks!

OS: [Ubuntu 20.0]
Packaging [conda]
Version [0.1.0]

Model download error

❓ Questions and Help

I tried to download the model using:
model = download_model("wmt-large-da-estimator-1719"

But I get the following error:

'''
AttributeError Traceback (most recent call last)
in

----> 4 model = download_model("wmt-large-da-estimator-1719")

7 frames
/proj/tools/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in setattr(self, name, value)
817 buffers[name] = value
818 else:
--> 819 object.setattr(self, name, value)
820
821 def delattr(self, name):

AttributeError: can't set attribute
'''

OS: [Ubuntu 18.04]
Packaging [pip]
Version [20.2.4]

fastBPE installation error

🐛 Bug

When installing either using pip install unbabel-comet or directly with pip install -r requirements.txt I get an error error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Running setup.py install for fastBPE ... error ERROR: Command errored out with exit status 1: command: /home/ubuntu/cometenv/bin/python3.7 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-eeqbscuj/fastbpe_45baab04cd16456fb32b018392790726/setup.py'"'"'; __file__='"'"'/tmp/pip-install-eeqbscuj/fastbpe_45baab04cd16456fb32b018392790726/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-hfrq684c/install-record.txt --single-version-externally-managed --compile --install-headers /home/ubuntu/cometenv/include/site/python3.7/fastBPE cwd: /tmp/pip-install-eeqbscuj/fastbpe_45baab04cd16456fb32b018392790726/ Complete output (15 lines): running install running build running build_py package init file 'fastBPE/__init__.py' not found (or not a regular file) running build_ext building 'fastBPE' extension creating build creating build/temp.linux-x86_64-3.7 creating build/temp.linux-x86_64-3.7/fastBPE x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -IfastBPE -I/usr/include/python3.7m -I/home/ubuntu/cometenv/include/python3.7m -c fastBPE/fastBPE.cpp -o build/temp.linux-x86_64-3.7/fastBPE/fastBPE.o -std=c++11 -Ofast -pthread fastBPE/fastBPE.cpp:28:10: fatal error: Python.h: No such file or directory #include "Python.h" ^~~~~~~~~~ compilation terminated. error: command 'x86_64-linux-gnu-gcc' failed with exit status 1 ---------------------------------------- ERROR: Command errored out with exit status 1: /home/ubuntu/cometenv/bin/python3.7 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-eeqbscuj/fastbpe_45baab04cd16456fb32b018392790726/setup.py'"'"'; __file__='"'"'/tmp/pip-install-eeqbscuj/fastbpe_45baab04cd16456fb32b018392790726/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-hfrq684c/install-record.txt --single-version-externally-managed --compile --install-headers /home/ubuntu/cometenv/include/site/python3.7/fastBPE Check the logs for full command output.

Environment

(Ubuntu 18.04) Version 34.0
Python 3.7

Warnings like "Some weights of the model checkpoint at xlm-roberta-large were not used when initializing XLMRobertaModel"

❓ Questions and Help

What is your question?

When I run "comet-score -s test.en-zh.en -t decoder-out -r test.en-zh.zh", I got the following warnings. Is that normal? or am I missing something?

/root/.cache/torch/unbabel_comet/wmt20-comet-da//checkpoints/model.ckpt
Some weights of the model checkpoint at xlm-roberta-large were not used when initializing XLMRobertaModel: ['lm_head.bias', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight']
This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Encoder model frozen.
/usr/local/python3/lib/python3.8/site-packages/torch/nn/modules/container.py:435: UserWarning: Setting attributes on ParameterList is not supported.
warnings.warn("Setting attributes on ParameterList is not supported.")
GPU available: True, used: True

What's your environment?

Linux
python 3.8
Version

Shortened version of comet-score output

🚀 Feature

Comet-score should be able to print a shortened score, as an average of all segment scores, when passed a particular flag (maybe something like --quiet). To the best of my knowledge, this does not seem possible currently (though of course, I could be wrong as I am new to this package)

Motivation

Currently, Comet-score prints a line-by-line score for each segment. This can be quite an overkill, especially if one is only interested in the score for the whole test set (which is currently calculated as an average for each segment score). Displaying only the average would be useful in these cases.

Additional context

This is the current output when I run comet-score from the CLI:

certificate verify failed: unable to get local issuer certificate

🐛 Bug

urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)>

error when running comet-score on python >3.7 MacOS.

To Reproduce

Install python via homebrew, create a virtual environment with comet and then run comet-score.

Screenshots

Environment

OS: macOS Mojave 10.14.6

Additional context

It seems that, for some reason, Brew has not run the Install Certificates.command that comes in the Python3 bundle for Mac.

Cannot disable the progress bar

🐛 Bug

Hi,

I'm using version 1.0.0rc6 and scoring within Python, seg_scores, sys_score = model.predict(data, gpus=1).

However, I cannot disable the progress bar using show_progress=False. It would be nice to have this option.

Thanks!

[QUESTION] What would be a reasonable/sound approach when we only have translation with reference?

Sometimes we only have translations with their references and no source. But the default COMET expects something like:

{"src": src, "mt": hyp, "ref": ref}

or QE comet

{"src": src, "mt": hyp}

Is there a way to let comet take

{"mt": hyp, "ref": ref}

Would this be a feasible approach?

{"src": ref, "mt": hyp, "ref": ref}

ImportError: cannot import name 'container_abcs' from 'torch._six'

🐛 Bug

If I have apex installed, this library throws ImportError: cannot import name 'container_abcs' from 'torch._six'.

To reproduce

When I try to run this example from the readme:

echo -e "Dem Feuer konnte Einhalt geboten werden\nSchulen und Kindergärten wurden eröffnet." >> src.de
echo -e "The fire could be stopped\nSchools and kindergartens were open" >> hyp1.en
echo -e "The fire could have been stopped\nSchools and pre-school were open" >> hyp2.en
echo -e "They were able to control the fire.\nSchools and kindergartens opened" >> ref.en
comet-score -s src.de -t hyp1.en -r ref.en

...I get: ImportError: cannot import name 'container_abcs' from 'torch._six', but if I uninstall apex, comet works again.

Torch versions:

torch==1.10.0
torchmetrics==0.6.0
torchtext==0.5.0
apex @ git+git://github.com/NVIDIA/apex.git@700d6825e205732c1d6be511306ca4e595297070

Traceback

Traceback (most recent call last):
  File "/home/scarrion/anaconda3/envs/mltests/bin/comet-score", line 5, in <module>
    from comet.cli.score import score_command
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/comet/__init__.py", line 19, in <module>
    from .download_utils import download_model
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/comet/download_utils.py", line 26, in <module>
    from comet.models import available_metrics
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/comet/models/__init__.py", line 17, in <module>
    from .regression.regression_metric import RegressionMetric
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/comet/models/regression/regression_metric.py", line 26, in <module>
    from comet.models.base import CometModel
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/comet/models/base.py", line 29, in <module>
    import pytorch_lightning as ptl
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/pytorch_lightning/__init__.py", line 20, in <module>
    from pytorch_lightning import metrics  # noqa: E402
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/pytorch_lightning/metrics/__init__.py", line 15, in <module>
    from pytorch_lightning.metrics.classification import (  # noqa: F401
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/__init__.py", line 14, in <module>
    from pytorch_lightning.metrics.classification.accuracy import Accuracy  # noqa: F401
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/accuracy.py", line 18, in <module>
    from pytorch_lightning.metrics.utils import deprecated_metrics
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/pytorch_lightning/metrics/utils.py", line 29, in <module>
    from pytorch_lightning.utilities import rank_zero_deprecation
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/pytorch_lightning/utilities/__init__.py", line 18, in <module>
    from pytorch_lightning.utilities.apply_func import move_data_to_device  # noqa: F401
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 26, in <module>
    from pytorch_lightning.utilities.imports import _compare_version, _TORCHTEXT_AVAILABLE
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/pytorch_lightning/utilities/imports.py", line 73, in <module>
    _APEX_AVAILABLE = _module_available("apex.amp")
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/pytorch_lightning/utilities/imports.py", line 36, in _module_available
    return find_spec(module_path) is not None
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/importlib/util.py", line 94, in find_spec
    parent = __import__(parent_name, fromlist=['__path__'])
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/apex/__init__.py", line 8, in <module>
    from . import amp
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/apex/amp/__init__.py", line 1, in <module>
    from .amp import init, half_function, float_function, promote_function,\
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/apex/amp/amp.py", line 1, in <module>
    from . import compat, rnn_compat, utils, wrap
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/apex/amp/rnn_compat.py", line 1, in <module>
    from . import utils, wrap
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/apex/amp/wrap.py", line 3, in <module>
    from ._amp_state import _amp_state
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/apex/amp/_amp_state.py", line 14, in <module>
    from torch._six import container_abcs
ImportError: cannot import name 'container_abcs' from 'torch._six' (/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/torch/_six.py)

Environment

OS: Ubuntu 20.04
Packaging: pip
Version: 1.0.1

Requirements conflicts

Hi,
I am using your tool in my python pipelines, but I had a problem, that your requirements are too strict and I get conflict with many tools. Could you please reinvestigate if you must specify all packages on exact version?
Additionally, are you planning to shift into transformer3?
Thank you,
TK

Add a check if language is supported by model

Hello,
In my ongoing evaluation of metrics, I have found out, that COMET (especially source based) is very unpredictable when it evaluates a language that is not supported by XLM model. This can easily happen because there is no list of supported languages for COMET on your git page and users would need to investigate this issue to XLM paper/git.

Would it be possible to add a check that language is supported (and thus ask for the language code when evaluating)?

Here I share my findings with you: Here is a graph for COMET source-based (QE), on Y-axis are human deltas and on X-axis are COMET deltas. The green dots around COMET delta 0 (it isn't exactly 0) are for language pairs where one of the languages is not supported by XLM, you can see that humans did found a difference for given languages but COMET was chaotic (other metrics doesn't have this problem).

[QUESTION] Train my own metrics without source sentences

Hi Ricardo, thank you so much for your previous answers.
I have a follow-up question regarding to train Comet myself without using any source sentences.

So the input to the system will be translated sentences, reference sentences and the rating of the translations.
Is there a way to do this in the current code base? Thank you in advance.

Continue to fine-tune from wmt20-comet-da checkpoint

Hi Ricardo, is it possible for me to train from your checkpoint wmt20-comet-da? Currently, I saw that "wmt20-comet-da" contains only the model.

[QUESTION] How to calculate corpus level COMET?

I understand COMET returns a single score for each sentence it evaluates. I was wondering if there is any way to report a corpus level metric, and what would one be? Similar to how BLEU is reported.

Model outputs error right after finishing training

🐛 Bug

Hello! I've tried to train my a comet model using my own data! I want to train using hter as a metric, I used your configuration that's present in the repo: https://github.com/Unbabel/COMET/blob/master/configs/xlmr/base/hter-estimator.yaml

To Reproduce

Python 3.6.9

python3 -m venv comet
pip install unbabel-comet
comet train -f config.yml

Where config.yml is the configuration I mentioned above with alterations to the training data path.
It does not seem to be an issue with the data as I have the correct column names and the model did train through the 2 epochs that were established in the configuration file.

Expected behaviour

Trained model, that could be loaded via python.

Screenshots

Here's the output from my logs.

Epoch 2: 100%|██████████| 25000/25000 [1:16:17<00:00,  5.46it/s, loss=0.056, v_num=4-54, pearson=0.924, kendall=0.81, spearman=0.946, avg_loss=0.0621] 
Traceback (most recent call last):                            
  File "/home/ubuntu/comet/bin/comet", line 33, in <module>
    sys.exit(load_entry_point('unbabel-comet==0.0.6', 'console_scripts', 'comet')())
  File "/home/ubuntu/comet/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/comet/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/comet/cli.py", line 63, in train
    trainer.fit(model)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 453, in fit
    self.call_hook('on_fit_end')
  File "/home/ubuntu/comet/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 835, in call_hook
    trainer_hook(*args, **kwargs)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/pytorch_lightning/trainer/callback_hook.py", line 57, in on_fit_end
    callback.on_fit_end(self, self.get_model())
  File "/home/ubuntu/comet/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py", line 35, in wrapped_fn
    return fn(*args, **kwargs)
TypeError: on_fit_end() takes 2 positional arguments but 3 were given

Environment

OS: Linux
Packaging: pip
Version: latest

Thank you for your time!

Cumprimentos,

Jose :-)

[QUESTION] Is comet download still a supported command?

❓ Questions and Help

What is your question?

Is comet download still a supported command? If not, what is the best way to download the data needed to reproduce the results?

Code

$ comet download --help
=>  command not found: comet
$ comet-download --help
=>  command not found: comet-download

What have you tried?

I tried running comet download as specified in data/README.md, and I get the error command not found: comet. However, comet-score and comet-compare works, so I know that I did install it. I also tried comet-download and still get the same error.

What's your environment?

OS: macOS Big Sur 11.5.2
Packaging: conda
Version 4.8.3
Python 3.9.7

How can I use QE model with HTER? (wmt20-comet-qe-hter)

I'm using wmt20-comet-qe-da for MT quality estimation. I wanted to use HTER based QE model for better interpretability. Is that model supported yet? If not can you guide me a bit to understand DA system scores? eg. I have source and MT. I get a DA score - what's the threshold score to be considered good or bad?

Read from STDIN

🚀 Feature

It would be really nice if COMET could read input from STDIN, e.g.,

# three fields triggers comet-ref
$ paste source.txt hyps.txt ref.txt | comet [args]

# two fields -> comet-src
$ paste source.txt hyps.txt | comet [args]

Motivation

This is consistent with standard UNIX usage. It is also slightly less cumbersome, and allows comet to be used in settings without writing files to disk.

[QUESTION] About HTER models in download list.

❓ Questions and Help

Before asking:

Search for similar issues.
Search the docs.

What is your question?

Hi, I found that the HTER models are off the download list of the current codes.
https://github.com/Unbabel/COMET/blob/master/comet/models/__init__.py
I wonder whether they are still supported in the current version.

I used version 1.0.0rc9, and it report this.
"Exception: wmt-large-hter-estimator is not in the availale_metrics or is a valid checkpoint folder."
Is that normal or should I use the previous version?
Thanks.

Code

What have you tried?

What's your environment?

OS: Linux
Packaging pip
Version 1.0.0rc9

proposed python version (3.6) throws error when installing requirements (3.7 works)

Running pip install -r requirements with proposed python3.6 results in: RuntimeError: Python version >= 3.7 required, when trying to install the fairseq module. Installation runs smoothly with 3.7.

Full error trace below:

Collecting fairseq==0.9.0 (from -r requirements.txt (line 7))
Cache entry deserialization failed, entry ignored
Cache entry deserialization failed, entry ignored
Downloading https://files.pythonhosted.org/packages/67/bf/de299e082e7af010d35162cb9a185dc6c17db71624590f2f379aeb2519ff/fairseq-0.9.0.tar.gz (306kB)
  100% |████████████████████████████████| 307kB 2.0MB/s 
  Complete output from command python setup.py egg_info:
  Traceback (most recent call last):
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 154, in save_modules
      yield saved
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 195, in setup_context
      yield
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 250, in run_setup
      _execfile(setup_script, ns)
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 45, in _execfile
      exec(code, globals, locals)
    File "/tmp/easy_install-s58jkscl/numpy-1.20.1/setup.py", line 30, in <module>
      self.__include_dirs = []
  RuntimeError: Python version >= 3.7 required.
  
  During handling of the above exception, another exception occurred:
  
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-build-5bklbiad/fairseq/setup.py", line 161, in <module>
      zip_safe=False,
    File "/usr/lib/python3/dist-packages/setuptools/__init__.py", line 128, in setup
      _install_setup_requires(attrs)
    File "/usr/lib/python3/dist-packages/setuptools/__init__.py", line 123, in _install_setup_requires
      dist.fetch_build_eggs(dist.setup_requires)
    File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 513, in fetch_build_eggs
      replace_conflicting=True,
    File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 774, in resolve
      replace_conflicting=replace_conflicting
    File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1057, in best_match
      return self.obtain(req, installer)
    File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1069, in obtain
      return installer(requirement)
    File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 580, in fetch_build_egg
      return cmd.easy_install(req)
    File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 698, in easy_install
      return self.install_item(spec, dist.location, tmpdir, deps)
    File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 724, in install_item
      dists = self.install_eggs(spec, download, tmpdir)
    File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 909, in install_eggs
      return self.build_and_install(setup_script, setup_base)
    File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 1177, in build_and_install
      self.run_setup(setup_script, setup_base, args)
    File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 1163, in run_setup
      run_setup(setup_script, args)
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 253, in run_setup
      raise
    File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
      self.gen.throw(type, value, traceback)
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 195, in setup_context
      yield
    File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
      self.gen.throw(type, value, traceback)
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 166, in save_modules
      saved_exc.resume()
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 141, in resume
      six.reraise(type, exc, self._tb)
    File "/usr/lib/python3/dist-packages/setuptools/_vendor/six.py", line 685, in reraise
      raise value.with_traceback(tb)
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 154, in save_modules
      yield saved
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 195, in setup_context
      yield
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 250, in run_setup
      _execfile(setup_script, ns)
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 45, in _execfile
      exec(code, globals, locals)
    File "/tmp/easy_install-s58jkscl/numpy-1.20.1/setup.py", line 30, in <module>
      self.__include_dirs = []
  RuntimeError: Python version >= 3.7 required.

[QUESTION] Training my own metric using Comet regression model

Hi Ricardo, I read all your code implementations. From my understanding, when I use comet-train to train my own evaluation metric, I will not load in any pretrained Comet weights except the initialization of XLM-Roberta weights. I just want to double check if this is correct. Currently, I set "resume_from_checkpoint" in the trainer.yaml in the config to be "null".

Thank you in advance! This is a great work.

Progress bar should go to stderr not stdout

🐛 Bug

I'm running

for i in *.output.txt; do
 comet-score -s src.txt -r newstest2021.en-de.ref.ref-A.de -t $i >$i.comet
done

and the progress bar is going into $i.comet.

To Reproduce

comet-score -s src.txt -r ref.txt -t hyp.txt >output
less output

Expected behaviour

It would be better to:

Send progress to stderr instead of stdout.
Sense if the progress bar is going to a terminal and suppress if not.

Screenshots

less output does this

^MPredicting: 0it [00:00, ?it/s]^MPredicting:   1%|          | 1/126 [00:01<02:30,  1.20s/it]^MPredicting:   2%|▏         | 2/126 [00:01<01:30,  1.37it/s]^MPredicting:   2%|▏         | 3/126 [00:01<01:13,  1.67it/s]^MPredicting:   3%|▎         | 4/126 [00:02<01:07,  1.81it/s]^MPredicting:   4%|▍         | 5/126 [00:02<01:02,  1.94it/s]^MPredicting:   5%|▍         | 6/126 [00:02<00:58,  2.05it/s]^MPredicting:   6%|▌         | 7/126 [00:03<00:55,  2.14it/s]^MPredicting:   6%|▋         | 8/126 [00:03<00:52,  2.26it/s]^MPredicting:   7%|▋         | 9/126 [00:03<00:50,  2.30it/s]^MPredicting:   8%|▊         | 10/126 [00:04<00:48,  2.40it/s]^MPredicting:   9%|▊         | 11/126 [00:04<00:46,  2.45it/s]^MPredicting:  10%|▉         | 12/126 [00:04<00:46,  2.47it/s]^MPredicting:  10%|█         | 13/126 [00:05<00:46,  2.45it/s]^MPredicting:  11%|█         | 14/126 [00:05<00:45,  2.48it/s]^MPredicting:  12%|█▏        | 15/126 [00:05<00:43,  2.53it/s]^MPredicting:  13%|█▎        | 16/126 [00:06<00:43,  2.54it/s]^MPredicting:  13%|█▎        | 17/126 [00:06<00:42,  2.58it/s]^MPredicting:  14%|█▍        | 18/126 [00:06<00:41,  2.62it/s]^MPredicting:  15%|█▌        | 19/126 [00:07<00:40,  2.65it/s]^MPredicting:  16%|█▌        | 20/126 [00:07<00:39,  2.66it/s]^MPredicting:  17%|█▋        | 21/126 [00:07<00:39,  2.66it/s]^MPredicting:  17%|█▋        | 22/126 [00:08<00:39,  2.63it/s]^MPredicting:  18%|█▊        | 23/126 [00:08<00:38,  2.64it/s]^MPredicting:  19%|█▉        | 24/126 [00:09<00:38,  2.65it/s]^MPredicting:  20%|█▉        | 25/126 [00:09<00:37,  2.68it/s]^MPredicting:  21%|██        | 26/126 [00:09<00:36,  2.71it/s]^MPredicting:  21%|██▏       | 27/126 [00:09<00:36,  2.74it/s]^MPredicting:  22%|██▏       | 28/126 [00:10<00:35,  2.76it/s]^MPredicting:  23%|██▎       | 29/126 [00:10<00:35,  2.77it/s]^MPredicting:  24%|██▍       | 30/126 [00:10<00:34,  2.76it/s]^MPredicting:  25%|██▍       | 31/126 [00:11<00:34,  2.76it/s]^MPredicting:  25%|██▌       | 32/126 [00:11<00:34,  2.76it/s]^MPredicting:  26%|██▌       | 33/126 [00:11<00:33,  2.77it/s]^MPredicting:  27%|██▋       | 34/126 [00:12<00:32,  2.80it/s]^MPredicting:  28%|██▊       | 35/126 [00:12<00:32,  2.83it/s]^MPredicting:  29%|██▊       | 36/126 [00:12<00:31,  2.85it/s]^MPredicting:  29%|██▉       | 37/126 [00:12<00:31,  2.86it/s]^MPredicting:  30%|███       | 38/126 [00:13<00:30,  2.87it/s]^MPredicting:  31%|███       | 39/126 [00:13<00:30,  2.88it/s]^MPredicting:  32%|███▏      | 40/126 [00:13<00:29,  2.89it/s]^MPredicting:  33%|███▎      | 41/126 [00:14<00:29,  2.89it/s]^MPredicting:  33%|███▎      | 42/126 [00:14<00:28,  2.90it/s]^MPredicting:  34%|███▍      | 43/126 [00:14<00:28,  2.92it/s]^MPredicting:  35%|███▍      | 44/126 [00:14<00:27,  2.94it/s]^MPredicting:  36%|███▌      | 45/126 [00:15<00:27,  2.93it/s]^MPredicting:  37%|███▋      | 46/126 [00:15<00:27,  2.93it/s]^MPredicting:  37%|███▋      | 47/126 [00:16<00:27,  2.89it/s]^MPredicting:  38%|███▊      | 48/126 [00:16<00:27,  2.88it/s]^MPredicting:  39%|███▉      | 49/126 [00:16<00:26,  2.91it/s]^MPredicting:  40%|███▉      | 50/126 [00:17<00:25,  2.93it/s]^MPredicting:  40%|████      | 51/126 [00:17<00:25,  2.93it/s]^MPredicting:  41%|████▏     | 52/126 [00:17<00:25,  2.94it/s]^MPredicting:  42%|████▏     | 53/126 [00:18<00:24,  2.94it/s]^MPredicting:  43%|████▎     | 54/126 [00:18<00:24,  2.95it/s]^MPredicting:  44%|████▎     | 55/126 [00:18<00:24,  2.92it/s]^MPredicting:  44%|████▍     | 56/126 [00:19<00:24,  2.91it/s]^MPredicting:  45%|████▌     | 57/126 [00:19<00:23,  2.91it/s]^MPredicting:  46%|████▌     | 58/126 [00:19<00:23,  2.91it/s]^MPredicting:  47%|████▋     | 59/126 [00:20<00:23,  2.91it/s]^MPredicting:  48%|████▊     | 60/126 [00:20<00:22,  2.90it/s]^MPredicting:  48%|████▊     | 61/126 [00:20<00:22,  2.91it/s]^MPredicting:  49%|████▉     | 62/126 [00:21<00:22,  2.91it/s]^MPredicting:  50%|█████     | 63/126 [00:21<00:21,  2.91it/s]^MPredicting:  51%|█████     | 64/126 [00:22<00:21,  2.89it/s]^MPredicting:  52%|█████▏    | 65/126 [00:22<00:21,  2.90it/s]^MPredicting:  52%|█████▏    | 66/126 [00:22<00:20,  2.88it/s]^MPredicting:  53%|█████▎    | 67/126 [00:23<00:20,  2.85it/s]^MPredicting:  54%|█████▍    | 68/126 [00:23<00:20,  2.84it/s]^MPredicting:  55%|█████▍    | 69/126 [00:24<00:20,  2.85it/s]^MPredicting:  56%|█████▌    | 70/126 [00:24<00:19,  2.86it/s]^MPredicting:  56%|█████▋    | 71/126 [00:24<00:19,  2.87it/s]^MPredicting:  57%|█████▋    | 72/126 [00:25<00:18,  2.87it/s]^MPredicting:  58%|█████▊    | 73/126 [00:25<00:18,  2.87it/s]^MPredicting:  59%|█████▊    | 74/126 [00:25<00:18,  2.87it/s]^MPredicting:  60%|█████▉    | 75/126 [00:25<00:17,  2.89it/s]^MPredicting:  60%|██████    | 76/126 [00:26<00:17,  2.90it/s]^MPredicting:  61%|██████    | 77/126 [00:26<00:16,  2.90it/s]^MPredicting:  62%|██████▏   | 78/126 [00:26<00:16,  2.91it/s]^MPredicting:  63%|██████▎   | 79/126 [00:27<00:16,  2.91it/s]^MPredicting:  63%|██████▎   | 80/126 [00:27<00:15,  2.91it/s]^MPredicting:  64%|██████▍   | 81/126 [00:27<00:15,  2.90it/s]^MPredicting:  65%|██████▌   | 82/126 [00:28<00:15,  2.90it/s]^MPredicting:  66%|██████▌   | 83/126 [00:28<00:14,  2.92it/s]^MPredicting:  67%|██████▋   | 84/126 [00:28<00:14,  2.93it/s]^MPredicting:  67%|██████▋   | 85/126 [00:28<00:13,  2.93it/s]^MPredicting:  68%|██████▊   | 86/126 [00:29<00:13,  2.93it/s]^MPredicting:  69%|██████▉   | 87/126 [00:29<00:13,  2.91it/s]^MPredicting:  70%|██████▉   | 88/126 [00:30<00:13,  2.91it/s]^MPredicting:  71%|███████   | 89/126 [00:30<00:12,  2.90it/s]^MPredicting:  71%|███████▏  | 90/126 [00:30<00:12,  2.90it/s]^MPredicting:  72%|███████▏  | 91/126 [00:31<00:12,  2.91it/s]^MPredicting:  73%|███████▎  | 92/126 [00:31<00:11,  2.90it/s]^MPredicting:  74%|███████▍  | 93/126 [00:32<00:11,  2.90it/s]^MPredicting:  75%|███████▍  | 94/126 [00:32<00:11,  2.89it/s]^MPredicting:  75%|███████▌  | 95/126 [00:32<00:10,  2.89it/s]^MPredicting:  76%|███████▌  | 96/126 [00:33<00:10,  2.88it/s]^MPredicting:  77%|███████▋  | 97/126 [00:33<00:10,  2.88it/s]^MPredicting:  78%|███████▊  | 98/126 [00:34<00:09,  2.88it/s]^MPredicting:  79%|███████▊  | 99/126 [00:34<00:09,  2.87it/s]^MPredicting:  79%|███████▉  | 100/126 [00:34<00:09,  2.88it/s]^MPredicting:  80%|████████  | 101/126 [00:35<00:08,  2.88it/s]^MPredicting:  81%|████████  | 102/126 [00:35<00:08,  2.88it/s]^MPredicting:  82%|████████▏ | 103/126 [00:35<00:07,  2.88it/s]^MPredicting:  83%|████████▎ | 104/126 [00:36<00:07,  2.88it/s]^MPredicting:  83%|████████▎ | 105/126 [00:36<00:07,  2.88it/s]^MPredicting:  84%|████████▍ | 106/126 [00:36<00:06,  2.87it/s]^MPredicting:  85%|████████▍ | 107/126 [00:37<00:06,  2.87it/s]^MPredicting:  86%|████████▌ | 108/126 [00:37<00:06,  2.87it/s]^MPredicting:  87%|████████▋ | 109/126 [00:38<00:05,  2.86it/s]^MPredicting:  87%|████████▋ | 110/126 [00:38<00:05,  2.86it/s]^MPredicting:  88%|████████▊ | 111/126 [00:38<00:05,  2.86it/s]^MPredicting:  89%|████████▉ | 112/126 [00:39<00:04,  2.86it/s]^MPredicting:  90%|████████▉ | 113/126 [00:39<00:04,  2.85it/s]^MPredicting:  90%|█████████ | 114/126 [00:39<00:04,  2.85it/s]^MPredicting:  91%|█████████▏| 115/126 [00:40<00:03,  2.85it/s]^MPredicting:  92%|█████████▏| 116/126 [00:40<00:03,  2.85it/s]^MPredicting:  93%|█████████▎| 117/126 [00:41<00:03,  2.85it/s]^MPredicting:  94%|█████████▎| 118/126 [00:41<00:02,  2.85it/s]^MPredicting:  94%|█████████▍| 119/126 [00:41<00:02,  2.84it/s]^MPredicting:  95%|█████████▌| 120/126 [00:42<00:02,  2.81it/s]^MPredicting:  96%|█████████▌| 121/126 [00:43<00:01,  2.80it/s]^MPredicting:  97%|█████████▋| 122/126 [00:43<00:01,  2.79it/s]^MPredicting:  98%|█████████▊| 123/126 [00:44<00:01,  2.79it/s]^MPredicting:  98%|█████████▊| 124/126 [00:44<00:00,  2.80it/s]^MPredicting:  99%|█████████▉| 125/126 [00:44<00:00,  2.80it/s]^MPredicting: 100%|██████████| 126/126 [00:44<00:00,  2.81it/s]^MPredicting: 100%|██████████| 126/126 [00:45<00:00,  2.79it/s]

Environment

OS: Ubuntu 20.04 x86_64
Packaging pip
Version pip install unbabel-comet==1.0.0rc2

Additional context

TypeError: 'type' object is not subscriptable

🐛 Bug

After installation, whether via pip, poetry, or direct usage (./comet/cli/score.py), I get the following error:

$ PYTHONPATH=. python3 ./comet/cli/score.py
Traceback (most recent call last):
  File "./comet/cli/score.py", line 56, in <module>
    from comet.download_utils import download_model
  File "/home/mattpost/src/COMET/comet/__init__.py", line 19, in <module>
    from .download_utils import download_model
  File "/home/mattpost/src/COMET/comet/download_utils.py", line 26, in <module>
    from comet.models import available_metrics
  File "/home/mattpost/src/COMET/comet/models/__init__.py", line 17, in <module>
    from .regression.regression_metric import RegressionMetric
  File "/home/mattpost/src/COMET/comet/models/regression/regression_metric.py", line 26, in <module>
    from comet.models.base import CometModel
  File "/home/mattpost/src/COMET/comet/models/base.py", line 41, in <module>
    class OrderedSampler(Sampler[int]):
TypeError: 'type' object is not subscriptable

This is with Python 3.8.10, so relatively recent.

Environment

OS: Linux (ubuntu 20.04.3)
Packaging: all
Version: 1.0.1 (latest source)

[QUESTION] Can Different COMET Metrics Give Opposing Results for Same MT System

Hello,

Our system is being validated against both "wmt-large-da-estimator-1719" and "wmt-large-hter-estimator" estimators with the same translations dataset, of course (70k+ translations).

The two estimators give completely opposite results.
The "da" estimator is placing our MT system in "...the bottom 25%" while the "HTER" estimator returns a "top 25%" score.

I know this is not a technical issue, but can you please provide some additional information on how we might be able to interpret those types of results?

Thank you very much

small score difference for identical outputs

🐛 Bug

Using comet-compare
I noticed that the exact same outputs of two different systems receive different score, although almost negligible.
But, although negligibly different, these scores are non consider a tie; and hence there is an impact on the number of wins/losses reported.

    "ties (%)": 0.0,
    "x_wins (%)": 1.0,
    "y_wins (%)": 0.0

{
    "src": "Nedávno prohrál s Raonicem v Brisbane Open.",
    "system_x": {
        "mt": "He recently lost to Raonic at the Brisbane Open.",
        "score": 0.8726277947425842
    },
    "system_y": {
        "mt": "He recently lost to Raonic at the Brisbane Open.",
        "score": 0.872564971446991
    },
    "ref": "He recently lost against Raonic in the Brisbane Open."
},

To Reproduce

comet-compare -s SRC -r REF -x SysX -y SysY --to_json JJJ

Actually, I am using directly the function "compare_command()" included in "cli/compare.py"

Expected behaviour

Either an identical score for identical outputs, or a bit more flexible counts of wins/losses/ties

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

OS: ubuntu
Packaging: pip3
Version: unbabel-comet==1.0.0rc8

Problems with refless model

🐛 Bug

there are several issues with refless models:

the default refless model "wmt20-comet-qe-da" si not included in the _REFLESS_MODELS;

COMET/comet/cli/score.py

Line 43 in 6009f67

_REFLESS_MODELS = ["comet-qe"]
even if added, the tool fails here

COMET/comet/cli/compare.py

Line 118 in 6009f67

if "refless" in cfg.model:
the tool outputs the ref even if the model is reflesse; hence, it raises an error;

COMET/comet/cli/compare.py

Line 145 in 6009f67

"ref": system_y[0]["ref"],

Actually, I solved all the bugs in my local code.

To Reproduce

Before reporting a bug, make sure that the bug can be reproduced with a minimal example and add your relevant changes, to see if the issue persists.

If the test is failing, please add your test cases to the issue (as a draft PR, or simply paste the code to the issue description here).

Environment

OS: ubuntu
Packaging: pip3
Version: unbabel-comet==1.0.0rc8

Benchmark tests?

I've installed COMET on two different machines, (Python 3.8.10, Ubuntu, running on GPU, and 3.9.10, MacOS, running on CPU) and am getting vastly different results on each machine for the same model and the same texts. Clearly, they can't both be correct.

I'm wondering whether there are any benchmark results that I can compare to, for example the simple examples in the installation instructions, so I can try to figure out what's going on and validate the installation. Also, I'm wondering whether the fact that I'm running COMET on Chinese characters might have something to do with the different results. Are there any benchmark results for EN<>CN?

Thanks.

	if (cfg.references is None) and ("refless" not in cfg.model):
	parser.error("{} requires -r/--references.".format(cfg.model))

unbabel / comet Goto Github PK

comet's People

Contributors

Stargazers

Watchers

Forkers

comet's Issues

🐛 Bug

Environment

To Reproduce

🐛 Bug

To Reproduce

Expected behaviour

Fix

Environment

Additional context

🐛 Bug

To Reproduce

Environment

🐛 Bug

To Reproduce

Screenshots

Environment

🚀 Feature

Motivation

Alternatives

Additional context

🚀 Feature

Motivation

🐛 Bug

To Reproduce

Expected behaviour

Environment

🐛 Bug

To Reproduce

Expected behaviour

Screenshots

Environment

Additional context

🚀 Feature

Motivation

🚀 Feature

Motivation

❓ Questions and Help

🐛 Bug

Environment

❓ Questions and Help

What is your question?

What's your environment?

🚀 Feature

Motivation

Additional context

🐛 Bug

To Reproduce

Screenshots

Environment

Additional context

🐛 Bug

🐛 Bug

To reproduce

Traceback

Environment

🐛 Bug

To Reproduce

Expected behaviour

Screenshots

Environment

❓ Questions and Help

What is your question?

Code

What have you tried?

What's your environment?

🚀 Feature

Motivation

❓ Questions and Help

Before asking:

What is your question?

Code

What have you tried?

What's your environment?