Giter Club home page Giter Club logo

bleurt's Introduction

Google Research

This repository contains code released by Google Research.

All datasets in this repository are released under the CC BY 4.0 International license, which can be found here: https://creativecommons.org/licenses/by/4.0/legalcode. All source files in this repository are released under the Apache 2.0 license, the text of which can be found in the LICENSE file.


Because the repo is large, we recommend you download only the subdirectory of interest:

SUBDIR=foo
svn export https://github.com/google-research/google-research/trunk/$SUBDIR

If you'd like to submit a pull request, you'll need to clone the repository; we recommend making a shallow clone (without history).

git clone [email protected]:google-research/google-research.git --depth=1

Disclaimer: This is not an official Google product.

Updated in 2023.

bleurt's People

Contributors

tsellam avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bleurt's Issues

issue

As I follow the command until:

wget https://storage.googleapis.com/bleurt-oss/bleurt-base-128.zip .
unzip bleurt-base-128.zip
python -m bleurt.score
-candidate_file=bleurt/test_data/candidates
-reference_file=bleurt/test_data/references
-bleurt_checkpoint=bleurt-base-128

I actually finish the step in your program, but as I run Python's API ,
the code like this:

from bleurt import score

checkpoint = "C:\bleurt-master\bert-base-128"
references = ["This is a test."]
candidates = ["This is the test."]

scorer = score.BleurtScorer(checkpoint)
scores = scorer.score(references=references, candidates=candidates)
assert type(scores) == list and len(scores) == 1
print(scores)

It occurred that:AssertionError: Could not find BLEURT checkpoint Cleurt-masteert-base-128

How to use the checkpoints of BERT "warmed up" with synthetic ratings?

After I downloaded bleurt and successfully use the test_checkpoint or the fine-tuned checkpoint, I am thinking use the "warmed up" version. However, if I directly download the "warmed up" checkpoint and use it, it will show an error:

OSError: SavedModel file does not exist at: bleurt/bert-large-midtrained/bert-large//{saved_model.pbtxt|saved_model.pb}

After looking into the details, I found the files type under "warmed up" checkpoint is different than those under test_checkpoint or fine-tuned checkpoint.
Under fine-tuned checkpoint:

bert_config.json  bleurt_config.json  saved_model.pb  variables  vocab.txt

Under warmed up checkpoint:

bert_config.json  bert-large.data-00000-of-00001  bert-large.index  bert-large.meta  bleurt_config.json  vocab.txt

So how can we directly use the warmed up checkpoint for evaluation?

Install fails (ERROR: Could not install packages due to an EnvironmentError)

I am trying to install this on my linux machine but it fails (error shown below). Install works fine on my Mac however. Cant find anything useful online- any suggestions to get this running?

ERROR: Could not install packages due to an EnvironmentError: [('/mnt/batch/tasks/shared/LS_root/mounts/clusters/path_to_bleurt/bleurt/.git/branches',.... (very long error)]

Is text truncation to 512 tokens handled automatically for both candidate and reference texts?

Hi,

I wanted to clarify the following information. On the checkpoints page here, you mention that

Currently, the following six BLEURT checkpoints are available, fine-tuned on WMT Metrics ratings data from 2015 to 2018. They vary on two aspects: the size of the model, and the size of the input.

Let's say I am using the following model - BLEURT-Base, 512 (max #tokens). In my case, both generated text and reference text are longer than 512 tokens. While computing the BLEURT, will it automatically truncate both generated text and reference text to fit the requirement and then calculate the score between truncated versions of generated text and reference text? Or do I need to cut the length of generated text and reference text manually before calling the function to calculate BLEURT?

Many thanks in advance,
Ruslan

db_builder.py add support for WMT20?

Hi. Thanks for your amazing efforts. The wmt data downloader is really a helpful tool.
I wonder are you considering adding support for WMT20? That would be great.

Thanks!

file input from process substitution doesnt work

this works:

 python -m bleurt.score  -candidate_file bleurt/test_data/candidates \
   -reference_file bleurt/test_data/references \
   -bleurt_checkpoint=bleurt/test_checkpoint

But this does not work (on my machine):

python -m bleurt.score  -candidate_file <(head bleurt/test_data/candidates)
   -reference_file <(head bleurt/test_data/references) 
   -bleurt_checkpoint=bleurt/test_checkpoint

Why?

NOTE:
<(head /path/to/file) is meant to demostrate a simple usecase of process subsititution

In reality, someone like me would use
<(detokenize.sh < /path/to/file)
or <(cut -f2 /path/to/file.tsv)
and those should work with tf.io.gfile.GFile, right?

Interpreting Bleurt scores

Hi,
sorry for the stupid question, maybe you have already answered to this.
You wrote: "In practice however, the answers tend to be very correlated with fluency ("Is the text fluent English?"), and we added synthetic noise in the training set which makes the distinction between adequacy and fluency somewhat fuzzy",
I'm a little bit confused by this.
Which are ultimately the aspects evaluated in the text by the metric in translation task?
Thanks and sorry for bad english ;)

How to use multiple reference with BLEURT?

It looks like the BLEURT is calculating the score for one hyp and one ref. What if I have multiple references like 5 or 6? Should I calculate the score with each of them and average for every single sentence?

What tensorflow versions are possible?

I am trying to run bleurt with tensorflow 2.15 and I get

TypeError: Binding inputs to tf.function failed due to `got an unexpected keyword argument 'input_ids'`. Received args: () and kwargs: {'input_ids': <tf.Tensor: shape=(10, 128), dtype=int64, numpy=

Which looks like my tensorflow is too new.

What is the newest tensorflow version supported by bleurt?

Results mismatch using released BLEURT-Large-128

Hi there,

Recently I'm interested in reimplementing and investigating your research. However, when I directly use your released code and BLEURT-Large-128 checkpoint model, I can't get comparable results with what you present here. Here is what I got:

  • de-en: 29.15
  • fi-en: 30.64
  • gu-en: 27.49
  • kk-en: 39.14
  • lt-en: 34.28
  • ru-en: 26.75
  • zh-en: 41.98
  • avg: 32.78

I first follow your command to get the WMT2019 data and the BLEURT-Large-128 checkpoint. After evaluating the whole dataset file, I collect the prediction scores, split the results and corresponding golden scores by language pairs, and compute Kendall results using scipy.stats.kendalltau as your implementation.

So I'm wondering I've missed any detail. Could you help me? Thanks!

Python API test_checkpoint not found

Hi,

Thank you for releasing the code. Interesting work!

I am trying to use the Python API in my code as:

from bleurt import score
checkpoint = "bleurt/test_checkpoint"
references = ["This is a test."]
candidates = ["This is the test."]
scorer = score.BleurtScorer(checkpoint)

however, the bleurt/test_checkpoint is not found.

>>> scorer = score.BleurtScorer(checkpoint)
INFO:tensorflow:Reading checkpoint bleurt/test_checkpoint.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/metrics/lib/python3.7/site-packages/bleurt/score.py", line 133, in __init__
    config = checkpoint_lib.read_bleurt_config(checkpoint)
  File "/opt/conda/envs/metrics/lib/python3.7/site-packages/bleurt/checkpoint.py", line 78, in read_bleurt_config
    "Could not find BLEURT checkpoint {}".format(path)
AssertionError: Could not find BLEURT checkpoint bleurt/test_checkpoint

Is there a missing download link here?

If I don't provide any checkpoint

scorer = score.BleurtScorer()

Expected: If bleurt_checkpoint is not specified, BLEURT will default to the test checkpoint, based on BERT-Tiny, however, I am getting the assertion error

>>> scorer = score.BleurtScorer()
INFO:tensorflow:No checkpoint specified, defaulting to BLEURT-tiny.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/metrics/lib/python3.7/site-packages/bleurt/score.py", line 130, in __init__
    checkpoint = _get_default_checkpoint()
  File "/opt/conda/envs/metrics/lib/python3.7/site-packages/bleurt/score.py", line 56, in _get_default_checkpoint
    "Default checkpoint not found! Are you sure the install is complete?"
AssertionError: Default checkpoint not found! Are you sure the install is complete?

Could you please suggest a way around. Thanks!

Getting BLEURT to Work in Jupyter Notebook on Windows: Unknown command line flag 'f', 32int error

Raising an unknown error flag. Installed exactly as described on front page.

  • tensorflow 2.3 or 2.0
  • python 3.7.5
  • Windows

Arises from either:

references = ['This is a test.', 'This is surely a test']
candidates = ['This is also a text', 'This could be a test']
checkpoint = 'C:/bleurt/bleurt/checkpoints/bleurt-tiny-512'
scorer = score.BleurtScorer(checkpoint)
scorer.score(references, candidates, batch_size = 2)

or

references = tf.constant(["This is a test."])
candidates = tf.constant(["This is the test."])
checkpoint = 'C:/bleurt/bleurt/checkpoints/bleurt-tiny-512'
scorer = score.BleurtScorer(checkpoint)
scorer.score(references, candidates, batch_size = 2)

Error:

UnrecognizedFlagError Traceback (most recent call last)
in
13
14 scorer = score.BleurtScorer(checkpoint)
---> 15 scorer.score(references, candidates, batch_size = 2)
16 # bleurt_out = scorer(references, candidates)
17 # # bleurt_ops = score.create_bleurt_ops()

c:\programdata\anaconda3\envs\context2\lib\site-packages\bleurt\score.py in score(self, references, candidates, batch_size)
178 batch_cand = candidates[i:i + batch_size]
179 input_ids, input_mask, segment_ids = encoding.encode_batch(
--> 180 batch_ref, batch_cand, self.tokenizer, self.max_seq_length)
181 tf_input = {
182 "input_ids": input_ids,

c:\programdata\anaconda3\envs\context2\lib\site-packages\bleurt\encoding.py in encode_batch(references, candidates, tokenizer, max_seq_length)
150 encoded_examples = []
151 for ref, cand in zip(references, candidates):
--> 152 triplet = encode_example(ref, cand, tokenizer, max_seq_length)
153 example = np.stack(triplet)
154 encoded_examples.append(example)

c:\programdata\anaconda3\envs\context2\lib\site-packages\bleurt\encoding.py in encode_example(reference, candidate, tokenizer, max_seq_length)
56 # Tokenizes, truncates and concatenates the sentences, as in:
57 # bert/run_classifier.py
---> 58 tokens_ref = tokenizer.tokenize(reference)
59 tokens_cand = tokenizer.tokenize(candidate)
60

c:\programdata\anaconda3\envs\context2\lib\site-packages\bleurt\lib\tokenization.py in tokenize(self, text)
144 def tokenize(self, text):
145 split_tokens = []
--> 146 for token in self.basic_tokenizer.tokenize(text):
147 if preserve_token(token, self.vocab):
148 split_tokens.append(token)

c:\programdata\anaconda3\envs\context2\lib\site-packages\bleurt\lib\tokenization.py in tokenize(self, text)
189 split_tokens = []
190 for token in orig_tokens:
--> 191 if preserve_token(token, self.vocab):
192 split_tokens.append(token)
193 continue

c:\programdata\anaconda3\envs\context2\lib\site-packages\bleurt\lib\tokenization.py in preserve_token(token, vocab)
43 def preserve_token(token, vocab):
44 """Returns True if the token should forgo tokenization and be preserved."""
---> 45 if not FLAGS.preserve_unused_tokens:
46 return False
47 if token not in vocab:

c:\programdata\anaconda3\envs\context2\lib\site-packages\tensorflow\python\platform\flags.py in getattr(self, name)
83 # a flag.
84 if not wrapped.is_parsed():
---> 85 wrapped(_sys.argv)
86 return wrapped.getattr(name)
87

c:\programdata\anaconda3\envs\context2\lib\site-packages\absl\flags_flagvalues.py in call(self, argv, known_only)
631 suggestions = _helpers.get_flag_suggestions(name, list(self))
632 raise _exceptions.UnrecognizedFlagError(
--> 633 name, value, suggestions=suggestions)
634
635 self.mark_as_parsed()

UnrecognizedFlagError: Unknown command line flag 'f'

How to calculate overall bleurt score?

HI,

When i run the code on my own dataset I get the scores file back with a score calculated for each sample. Am i supposed to take average on all the scores generated?

My dataset consists of the reference file with a single line per sample, and a generated summary file also with a single line per sample. I am doing data to text generation, and the reference file is all the expected output summaries, and the generated summary file is all the actual output summaries.

WMT17 dataset: Mismatch in candidate-reference pairs counts

Hi, I was trying to download the WMT17 dataset using the wmt/db_builder example shared in Experiments with the WMT Metrics shared task section. However, I found that downloading the WMT17 dataset in this manner results only in 3920 candidate-reference pairs, while the number of candidate-reference pairs in WMT17 is mentioned to be 5344 in the research paper.

Thus, I wished to check what might be causing this discrepancy in the count of candidate-reference pairs.

PS: I also tried setting average_duplicates flag to False while calling wmt/db_builder but that resulted in 4132 samples. Still lower than 5344 samples

Installation check error: Expected to be a int64 tensor but is a int32.

Hi ,

I have installed BLEURT and running the test script to test installation, getting below error. Paths to directory seems to be correct .

python -m bleurt.score
-candidate_file=bleurt/test_data/candidates
-reference_file=bleurt/test_data/references
-bleurt_checkpoint=bleurt/test_checkpoint
-scores_file=scores


INFO:tensorflow:BLEURT initialized.
I0630 08:28:46.424506 24396 score.py:151] BLEURT initialized.
INFO:tensorflow:Computing BLEURT scores...
I0630 08:28:46.424506 24396 score.py:305] Computing BLEURT scores...
Traceback (most recent call last):
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\bleurt\score.py", line 344, in
tf.app.run()
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\tensorflow_core\python\platform\app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\absl\app.py", line 299, in run
_run_main(main, args)
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\absl\app.py", line 250, in _run_main
sys.exit(main(argv))
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\bleurt\score.py", line 339, in main
FLAGS.bleurt_checkpoint)
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\bleurt\score.py", line 321, in score_files
_consume_buffer()
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\bleurt\score.py", line 300, in _consume_buffer
scores = scorer.score(ref_buffer, cand_buffer, FLAGS.bleurt_batch_size)
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\bleurt\score.py", line 186, in score
predict_out = self.predict_fn(tf_input)
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\bleurt\score.py", line 70, in _predict_fn
segment_ids=tf.constant(input_dict["segment_ids"])
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\tensorflow_core\python\eager\function.py", line 1551, in call
return self._call_impl(args, kwargs)
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\tensorflow_core\python\eager\function.py", line 1591, in _call_impl
return self._call_flat(args, self.captured_inputs, cancellation_manager)
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\tensorflow_core\python\eager\function.py", line 1692, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\tensorflow_core\python\eager\function.py", line 545, in call
ctx=ctx)
File "C:\Users\amit.prakash\Anaconda3\envs\tf\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: cannot compute __inference_pruned_1485 as input #0(zero-based) was expected to be a int64 tensor but is a int32 tensor [Op:__inference_pruned_1485]

Thanks,
Amit

BLUERT for Spanish

I want to use bleurt for a Image Captioning in spanish, but I was searching if there is any parameter to do that but I didn't find it.
So I wanted to know if BLEURT dectect the languaje automatically of the captions o there are some parameter that I need to change or how do BLEURT do that and how can I do that?

About the range of BLEURT scores.

Hi,
I'm trying to evaluate my models with BLEURT, and I find contradictory descriptions in README now and the previous issues about how to interpret BLEURT scores.

As mentioned in README, "The currently recommended checkpoint BLEURT-20 generates scores which are roughly between 0 and 1 (sometimes less than 0, sometimes more than 1), where 0 indicates a random output and 1 a perfect one."

And as mentioned in this issue, the statistics of training corpus has a large portion of samples has negative values.
#1

If I understand correctly, the BLEURT scores can be bounded manually by myself if I can regard 0 as random and 1 as perfect. For instance, the negative values can be set to 0, and values more than 1 can be truncated to 1.
Am I right?

WMT metric shared dataset download error

We cannot access the current download URL for the wmt17, wmt18 datasets.
When I run this command,

python -m bleurt.wmt.db_builder   -target_language="en"   -rating_years="2017"

It gives an error

INFO:tensorflow:Downloading newstest2017-segment-level-human from http://computing.dcu.ie/~ygraham/newstest2017-segment-level-human.tar.gz
I0824 12:51:08.780502 140389933356864 downloaders.py:139] Downloading newstest2017-segment-level-human from http://computing.dcu.ie/~ygraham/newstest2017-segment-level-human.tar.gz
Downloading data from http://computing.dcu.ie/~ygraham/newstest2017-segment-level-human.tar.gz
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/keras/utils/data_utils.py", line 274, in get_file
    urlretrieve(origin, fpath, dl_progress)
  File "/opt/conda/lib/python3.8/site-packages/keras/utils/data_utils.py", line 82, in urlretrieve
    response = urlopen(url, data)
  File "/opt/conda/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/opt/conda/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/opt/conda/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/opt/conda/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/opt/conda/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/opt/conda/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/private/metric_shared/bleurt/bleurt/wmt/db_builder.py", line 273, in <module>
    app.run(main)
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/data/private/metric_shared/bleurt/bleurt/wmt/db_builder.py", line 262, in main
    create_wmt_dataset(FLAGS.target_file, FLAGS.rating_years,
  File "/data/private/metric_shared/bleurt/bleurt/wmt/db_builder.py", line 100, in create_wmt_dataset
    importer.fetch_files()
  File "/data/private/metric_shared/bleurt/bleurt/wmt/downloaders.py", line 305, in fetch_files
    super(Importer17, self).fetch_files()
  File "/data/private/metric_shared/bleurt/bleurt/wmt/downloaders.py", line 140, in fetch_files
    _ = tf.keras.utils.get_file(
  File "/opt/conda/lib/python3.8/site-packages/keras/utils/data_utils.py", line 276, in get_file
    raise Exception(error_msg.format(origin, e.code, e.msg))
Exception: URL fetch failure on http://computing.dcu.ie/~ygraham/newstest2017-segment-level-human.tar.gz: 503 -- Service Unavailable

We have to change the URL to the HTTPS version, for example,
https://www.computing.dcu.ie/~ygraham/newstest2017-segment-level-human.tar.gz

UnrecognizedFlagError: Unknown command line flag 'f'

Hello!

I'm trying to run the code below, following the instructions in the README, and I'm getting an error. Can you help me? Follow the code used and the output. The tensorflow version used is 2.2.0.

import os
!git clone https://github.com/google-research/bleurt.git
os.chdir('bleurt')
!pip install .
from bleurt import score
import tensorflow as tf

checkpoint = "bleurt/test_checkpoint"
references = ["This is a test."]
candidates = ["This is the test."]

scorer = score.BleurtScorer(checkpoint)
scores = scorer.score(references, candidates)
assert type(scores) == list and len(scores) == 1
print(scores)


UnrecognizedFlagError Traceback (most recent call last)
in ()
9
10 scorer = score.BleurtScorer(checkpoint)
---> 11 scores = scorer.score(references, candidates)
12 assert type(scores) == list and len(scores) == 1
13 print(scores)

2 frames
/usr/local/lib/python3.6/dist-packages/absl/flags/_flagvalues.py in call(self, argv, known_only)
631 suggestions = _helpers.get_flag_suggestions(name, list(self))
632 raise _exceptions.UnrecognizedFlagError(
--> 633 name, value, suggestions=suggestions)
634
635 self.mark_as_parsed()

UnrecognizedFlagError: Unknown command line flag 'f'

Can't use predict_fn with TF2

Hi !

When I try to instantiate a PythonPredictor with a predict_fn function, I get this error:

    def __init__(self, predict_fn):
>     tf.logging.info("Creating Python-based predictor.")
E     AttributeError: module 'tensorflow' has no attribute 'logging'

I'm using TF2 so tensorflow.logging isn't available.

cc @tsellam I think this comes from the recent change to 0.0.2

How to load rembert distilled models?

Hi I am trying to load rembert distilled models for some of my downstream tasks. However, I am not able to do so.

AutoTokenizer.from_pretrained(model, **kwargs)

Can you help?

BLEURT consumes all available memory on checkpoint load?

Not quite sure what's happening here - running CUDA 11.6 and TensorFlow 2.10.0. No matter what checkpoint I use, all available GPU memory is consumed. Minimum reproducible example here:

bleurtcheckpoints = os.path.join(os.getcwd(), "bleurtcktpts")
from bleurt import score
checkpoint = os.path.join(bleurtcheckpoints, "bleurt-tiny-128/")
scorer = score.BleurtScorer(checkpoint)
INFO:tensorflow:Reading checkpoint /data/visualization/vis-text/datasets/vis-text/bleurtcktpts/bleurt-tiny-128/.
INFO:tensorflow:Config file found, reading.
INFO:tensorflow:Will load checkpoint bert_custom
INFO:tensorflow:Loads full paths and checks that files exists.
INFO:tensorflow:... name:bert_custom
INFO:tensorflow:... vocab_file:vocab.txt
INFO:tensorflow:... bert_config_file:bert_config.json
INFO:tensorflow:... do_lower_case:True
INFO:tensorflow:... max_seq_length:128
INFO:tensorflow:Creating BLEURT scorer.
INFO:tensorflow:Creating WordPiece tokenizer.
INFO:tensorflow:WordPiece tokenizer instantiated.
INFO:tensorflow:Creating Eager Mode predictor.
INFO:tensorflow:Loading model.

2022-10-25 21:53:54.567812: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:54.690050: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:54.691074: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:54.692770: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-25 21:53:54.693762: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:54.694475: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:54.695136: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:56.420924: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:56.422112: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:56.423287: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-25 21:53:56.424176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 11413 MB memory:  -> device: 0, name: NVIDIA TITAN Xp, pci bus id: 0000:00:05.0, compute capability: 6.1

INFO:tensorflow:BLEURT initialized.

nvidia-smi results after (right before, it's only showing 4 MiB in use):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN Xp     On   | 00000000:00:05.0 Off |                  N/A |
| 23%   30C    P2    58W / 250W |  11697MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     16853      C   /usr/bin/python3.8              11693MiB |
+-----------------------------------------------------------------------------+

Does bleurt support Chinese?

I tried to use your fine-tuned model on Chinese, but the result is awful with a 0.5 pearson correlation with sacrebleu. Is it because your model does not support Chinese? If not, then how can I use your codes on Chinese?

About fine tune

Why i finetuned the model and save it no weights are reserved expect some configs

UnrecognizedFlagError: Unknown command line flag 'f'

Getting this issue.

    10 bleurt_ops = score.create_bleurt_ops()
---> 11 bleurt_out = bleurt_ops(references, candidates)
     12 
     13 assert bleurt_out["predictions"].shape == (1,)

11 frames
/usr/local/lib/python3.6/dist-packages/absl/flags/_flagvalues.py in __call__(self, argv, known_only)
    631       suggestions = _helpers.get_flag_suggestions(name, list(self))
    632       raise _exceptions.UnrecognizedFlagError(
--> 633           name, value, suggestions=suggestions)
    634 
    635     self.mark_as_parsed()

UnrecognizedFlagError: Unknown command line flag 'f'

Issue with loading fine-tuned BLEURT

I am unable to load the fine-tuned model. It is falling back and loading the base model.

How I am loading the model -
model = evaluate.load("bleurt", module_type="metric", checkpoint="/finetuned_bleurt/export/bleurt_best/1689684085/")

I also tried the following, but same issue -
bleurt_model = evaluate.load("bleurt", module_type="metric", checkpoint="/finetuned_bleurt_base_128/")

What am I doing wrong here?

Optimising BLEURT for large dataset

Hi, great paper!
I am trying to use this implementation to compute BLEURT scores for > 60K English sentence pairs and even with the provided GPU optimisations, it takes days to calculate the scores.
Is there any way to configure this to run in a shorter time?

Checkpoints of fine-tuned on WebNLG

Could you release the checkpoints of fine-tuned on WebNLG, as mentioned in section 5.3 in the paper? If it's not allowed to be public, could you give more specific instructions on how to fine-tune on WebNLG ratings? Thanks!

Python API Error (UnrecognizedFlagError: Unknown command line flag 'f')

Hello,

I have installed the BLEURT according to steps mentioned in README. I also have run all installation tests, and everything works fine. Then, I am trying to use the Python API, following the mentioned script:

from bleurt import score

checkpoint = "bleurt/test_checkpoint"
references = ["This is a test."]
candidates = ["This is the test."]

scorer = score.BleurtScorer(checkpoint)
scores = scorer.score(references, candidates)
assert type(scores) == list and len(scores) == 1
print(scores)

I have checked that my Tensorflow is >=1.15, and tf-slim is >=1.1. I am using Python3.6 in Google Colab, in Tesla K80 GPU. However I got this error:

UnrecognizedFlagError                     Traceback (most recent call last)

<ipython-input-2-8427261da7b2> in <module>()
      6 
      7 scorer = score.BleurtScorer(checkpoint)
----> 8 scores = scorer.score(references, candidates)
      9 assert type(scores) == list and len(scores) == 1
     10 print(scores)

2 frames

/usr/local/lib/python3.6/dist-packages/bleurt/score.py in score(self, references, candidates, batch_size)
    164     """
    165     if not batch_size:
--> 166       batch_size = FLAGS.bleurt_batch_size
    167 
    168     candidates, references = list(candidates), list(references)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/flags.py in __getattr__(self, name)
     83     # a flag.
     84     if not wrapped.is_parsed():
---> 85       wrapped(_sys.argv)
     86     return wrapped.__getattr__(name)
     87 

/usr/local/lib/python3.6/dist-packages/absl/flags/_flagvalues.py in __call__(self, argv, known_only)
    631       suggestions = _helpers.get_flag_suggestions(name, list(self))
    632       raise _exceptions.UnrecognizedFlagError(
--> 633           name, value, suggestions=suggestions)
    634 
    635     self.mark_as_parsed()

UnrecognizedFlagError: Unknown command line flag 'f'

Could you suggest any way around? Thank you!

JIT compilation failed

I was trying to run the below sample code (python API) in Jupyter notebook but encounter this error

Sample Code:

import bleurt
from bleurt import score

checkpoint = "bleurt/test_checkpoint"
references = ["This is a test."]
candidates = ["This is the test."]

scorer = score.BleurtScorer(checkpoint)
scores = scorer.score(references=references, candidates=candidates)
assert isinstance(scores, list) and len(scores) == 1
print(scores)

Error:

2024-03-30 23:05:27.516741: W tensorflow/core/framework/op_kernel.cc:1827] UNKNOWN: JIT compilation failed.
2024-03-30 23:05:27.516767: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: UNKNOWN: JIT compilation failed.
	 [[{{node bert/embeddings/LayerNorm/batchnorm/Rsqrt}}]]

JIT compilation failed. 
[[{{node bert/embeddings/LayerNorm/batchnorm/Rsqrt}}]] [Op:__inference_pruned_2804]. 

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/anaconda3/envs/nlu/lib/python3.9/site-packages/tensorflow/python/eager/polymorphic_function/function_type_utils.py:442, in bind_function_inputs(args, kwargs, function_type, default_values)
    441 try:
--> 442   bound_arguments = function_type.bind_with_defaults(
    443       args, sanitized_kwargs, default_values
    444   )
    445 except Exception as e:

File ~/anaconda3/envs/nlu/lib/python3.9/site-packages/tensorflow/core/function/polymorphism/function_type.py:264, in FunctionType.bind_with_defaults(self, args, kwargs, default_values)
    263 """Returns BoundArguments with default values filled in."""
--> 264 bound_arguments = self.bind(*args, **kwargs)
    265 bound_arguments.apply_defaults()

File ~/anaconda3/envs/nlu/lib/python3.9/inspect.py:3045, in Signature.bind(self, *args, **kwargs)
   3041 """Get a BoundArguments object, that maps the passed `args`
   3042 and `kwargs` to the function's signature.  Raises `TypeError`
   3043 if the passed arguments can not be bound.
   3044 """
-> 3045 return self._bind(args, kwargs)

File ~/anaconda3/envs/nlu/lib/python3.9/inspect.py:3034, in Signature._bind(self, args, kwargs, partial)
   3033     else:
-> 3034         raise TypeError(
   3035             'got an unexpected keyword argument {arg!r}'.format(
   3036                 arg=next(iter(kwargs))))
   3038 return self._bound_arguments_cls(self, arguments)

TypeError: got an unexpected keyword argument 'input_ids'

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
File ~/anaconda3/envs/nlu/lib/python3.9/site-packages/tensorflow/python/eager/polymorphic_function/concrete_function.py:1179, in ConcreteFunction._call_impl(self, args, kwargs)
   1178 try:
-> 1179   return self._call_with_structured_signature(args, kwargs)
   1180 except TypeError as structured_err:

File ~/anaconda3/envs/nlu/lib/python3.9/site-packages/tensorflow/python/eager/polymorphic_function/concrete_function.py:1259, in ConcreteFunction._call_with_structured_signature(self, args, kwargs)
   1245 """Executes the wrapped function with the structured signature.
   1246 
   1247 Args:
   (...)
   1256     of this `ConcreteFunction`.
   1257 """
   1258 bound_args = (
-> 1259     function_type_utils.canonicalize_function_inputs(
   1260         args, kwargs, self.function_type)
   1261 )
   1262 filtered_flat_args = self.function_type.unpack_inputs(bound_args)

File ~/anaconda3/envs/nlu/lib/python3.9/site-packages/tensorflow/python/eager/polymorphic_function/function_type_utils.py:422, in canonicalize_function_inputs(args, kwargs, function_type, default_values, is_pure)
    421   args, kwargs = _convert_variables_to_tensors(args, kwargs)
--> 422 bound_arguments = bind_function_inputs(
    423     args, kwargs, function_type, default_values
    424 )
    425 return bound_arguments

File ~/anaconda3/envs/nlu/lib/python3.9/site-packages/tensorflow/python/eager/polymorphic_function/function_type_utils.py:446, in bind_function_inputs(args, kwargs, function_type, default_values)
    445 except Exception as e:
--> 446   raise TypeError(
    447       f"Binding inputs to tf.function failed due to `{e}`. "
    448       f"Received args: {args} and kwargs: {sanitized_kwargs} for signature:"
    449       f" {function_type}."
    450   ) from e
    451 return bound_arguments

TypeError: Binding inputs to tf.function failed due to `got an unexpected keyword argument 'input_ids'`. Received args: () and kwargs: {'input_ids': <tf.Tensor: shape=(1, 512), dtype=int64, numpy=

.......

nknownError: Graph execution error:
Detected at node bert/embeddings/LayerNorm/batchnorm/Rsqrt defined at (most recent call last)

I followed the readme instruction to install and it worked on command-line. Any idea on this error?

Error in finetuning BLEURT

Thank you for the great work and for open-sourcing it!

I am trying to follow the instructions in https://github.com/google-research/bleurt/blob/master/checkpoints.md#from-an-existing-bleurt-checkpoint to fine-tune the BLEURT-20 model on a customized set of ratings.

However, when I run the suggested command,

python -m bleurt.finetune \
  -train_set=../data/ratings_train.jsonl \
  -dev_set=../data/ratings_dev.jsonl \
  -num_train_steps=500 \
  -model_dir=../models/bleurt-20-fine1 \
  -init_bleurt_checkpoint=../models/BLEURT-20/

I get the following issue:

ValueError: Shape of variable bert/embeddings/LayerNorm/beta:0 ((1152,)) doesn't match with shape of tensor bert/embeddings/LayerNorm/beta ([256]) from checkpoint reader.

I have checked this with both tensorflow 2.7 and 1.15

Any help related to this would be appreciated!

Answer comparison inaccuracy

I'm wondering if I'm doing something wrong. Setting the candidate and reference with the command:

python -m bleurt.score   -candidate_file=bleurt/test_data/candidates   -reference_file=bleurt/test_data/references   -bleurt_checkpoint=bleurt/test_checkpoint   -scores_file=scores

I'm comparing the candidate and reference of the following:

"A group of tasks that you monitor as a single unit."
"Aggregates a set of tasks and synchronize behaviors on the group."

The result yields: -0.4743865132331848

My experience with NLP is still early, so I apologize if my expectations have been set too high with this.

Install with pip fails

When I'm try to install the repo as a package with pip, there is no problem in Linux, but in Windows Python 3.9 it fails due to fail in README.md. Basically, in Windows python tries to open file with cp1254 encoding by default, and it results in a failed installation giving the following error message

(base) C:\Users\devri>pip install git+https://github.com/google-research/bleurt.git --force-reinstall --no-cache-dir     
Collecting git+https://github.com/google-research/bleurt.git
  Cloning https://github.com/google-research/bleurt.git to c:\users\devri\appdata\local\temp\pip-req-build-l293_p23                    
  Running command git clone -q https://github.com/google-research/bleurt.git 'C:\Users\devri\AppData\Local\Temp\pip-req-build-l293_p23'
  Resolved https://github.com/google-research/bleurt.git to commit c6f2375c7c178e1480840cf27cb9e2af851394f9
    ERROR: Command errored out with exit status 1:
     command: 'C:\tools\Anaconda3\envs\jury\python.exe' -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\devri\\AppData\\Local\\Temp\\pip-req-build-l293_p23\\setup.py'"'"'; __file__='"'"'C:\\Users\\devri\\AppData\\Local\\Temp\\pip-req-build-l293_p23\\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.
exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\devri\AppData\Local\Temp\pip-pip-egg-info-s1jaf9gz'
         cwd: C:\Users\devri\AppData\Local\Temp\pip-req-build-l293_p23\
    Complete output (7 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\devri\AppData\Local\Temp\pip-req-build-l293_p23\setup.py", line 23, in <module>
        long_description = fh.read()
      File "C:\tools\Anaconda3\envs\base\lib\encodings\cp1254.py", line 23, in decode
        return codecs.charmap_decode(input,self.errors,decoding_table)[0]
    UnicodeDecodeError: 'charmap' codec can't decode byte 0x8e in position 2560: character maps to <undefined>

Incompatible dependencies with installing through pip

When installing this repo through pip, it raises the following errors.

ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

tensorflow 2.3.1 requires numpy<1.19.0,>=1.16.0, but you'll have numpy 1.19.2 which is incompatible.

My workaround is to install an older numpy version beforehand by running pip install numpy==1.18.5.

However, when I run the test, it raises $ python -m unittest bleurt.score_test Illegal instruction (core dumped) from bash.

BLEURT returning scores less than zero

I'm not sure if this is supposed to happen or not, but when testing BLEURT on the test_data ( "Bud Powell was a legendary pianist.", etc...) I'm getting scores that are way below zero:

  1. Here is my output for bleurt/test_checkpoint
0.9129246473312378
0.2755325436592102
-0.34470897912979126
-0.737292468547821
  1. And my output for bleurt-base-128
1.003721833229065
0.5313903093338013
-1.489485502243042
-1.6975871324539185

Though I may be wrong, my understanding is that scores should be between 0 and 1.
Thanks!

Code for Section 4 of the paper

Hi,

I was wondering if you released the code for either generating synthetic sentence pairs or running pre-training tasks that you mentioned in the Section 4 of the BLEURT paper. Thank you!

Reproducing table 2 results

Hello and thank you very much for your contribution to the field and open-sourcing the code.

I am trying to reproduce the table 2 results from the paper using the code specified here. I had to add a value for -max_seq_length since the command wouldn't run otherwise. I also train for 40k steps instead of 20k steps, as is specified in the paper. Otherwise, I am running the exact same command.

The results I obtain are different from the ones shown in Table 2 in the paper. Here's what I obtain

{"cs-en": {"kendall": 0.45062611806797853, "pearson": 0.6431185796263721, "spearman": 0.6229406024891663, "wmt_da_rr_kendall": -1.0, "sys-kendall": 1.0, "sys-pearson": 0.9755128178401778, "sys-spearman": 1.0}, "de-en": {"kendall": 0.4543906669799155, "pearson": 0.6222954800372593, "spearman": 0.6351588328764061, "wmt_da_rr_kendall": null, "sys-kendall": 0.5636363636363636, "sys-pearson": 0.8306197237216054, "sys-spearman": 0.8000000000000002}, "fi-en": {"kendall": 0.5518527983644262, "pearson": 0.7438350752272211, "spearman": 0.7497050145476959, "wmt_da_rr_kendall": null, "sys-kendall": 0.9999999999999999, "sys-pearson": 0.9914245385710291, "sys-spearman": 1.0}, "lv-en": {"kendall": 0.5359953999488883, "pearson": 0.7415974274659147, "spearman": 0.7285339831167466, "wmt_da_rr_kendall": null, "sys-kendall": 0.9444444444444445, "sys-pearson": 0.9648751710079252, "sys-spearman": 0.9833333333333333}, "ru-en": {"kendall": 0.5051750575006388, "pearson": 0.7192891660812555, "spearman": 0.6896893120559332, "wmt_da_rr_kendall": null, "sys-kendall": 0.8333333333333334, "sys-pearson": 0.9496574815839979, "sys-spearman": 0.9166666666666666}, "tr-en": {"kendall": 0.48177868642984917, "pearson": 0.6692174622720356, "spearman": 0.6613286166637742, "wmt_da_rr_kendall": null, "sys-kendall": 0.7333333333333333, "sys-pearson": 0.8750333198839647, "sys-spearman": 0.8545454545454544}, "zh-en": {"kendall": 0.47126245847176074, "pearson": 0.6771181689880665, "spearman": 0.6452188030847402, "wmt_da_rr_kendall": null, "sys-kendall": 0.7, "sys-pearson": 0.8358359469478294, "sys-spearman": 0.8617647058823529}}

Do you have any guidance as to what I might be doing wrong? Could it be that I'm not using the correct initial BLEURT checkpoint?

Thanks a lot

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.