embeddings-benchmark / mteb Goto Github PK

MTEB: Massive Text Embedding Benchmark

Home Page: https://arxiv.org/abs/2210.07316

License: Apache License 2.0

Python 99.52% Makefile 0.12% Shell 0.36%

benchmark clustering information-retrieval sentence-transformers sts text-embedding retrieval neural-search semantic-search sbert

mteb's People

Contributors

Stargazers

Watchers

Forkers

muennighoff amrmkayid cycycc techthiyanes skumarh89 sibtainrazajamali achintyax amanpreet692 aliang-nlp dopc hsl89 perceptiveshawty phymucs aysegulbumin ahoho alejandraro1992 goswamig stephantul nleroy917 izhx ai-bassem permutohedra jordane95 rahulseetharaman lml2468 mininglamp-mz gabbom hesic73 slvnwhrl regalius therakeshpurohit ghbacct gershwin97 bwanglzu hbcbh1999 avsolatorio liujuncn jaelgu kwojtasi jinlmsft kennethenevoldsen centre-for-humanities-computing aqhali apollohuang1 staoxiao rafalposwiata leo-gan garrett361 rsmith49 jina-ai guenthermi mgoin vineetm yutsai84 liuhong99 darcstar-solutions-tech dunzhang rufeng-h novak2000 maivtt podkamienna sandy4321 jina-ai lyon-nlp maisaai ihounie stat-eklee voladorlu thuwyh markus28 jade2290 dongjicheng ordalietech a7mad-magdy77 canslove sorokinvld duzida zhimin-z kekewind taeminlee discoresearch mixedbread-ai jina-ai taner45 hongjin-su avidale nanqiai flyingwing lishawn x-tabdeveloping hanhainebula ramunas-s cleverchloe 3a1b2c3 qingqinggit1 vgkienzler amandaabp ceia-memoreba lihuibng jixy2012

mteb's Issues

Sequence Length

Hello,

Thanks for this extensive work.

I've question about the sequence length of the various models used for this benchmark. Different models supports different sequence lengths, like text-embedding-ada upto 8191 tokens, while instructor-xl trained only with 512 max token length. Are these considered during evaluation?

Please forgive if I'm being ignorant.

No 'validation' split for mteb/sickr-sts: raise KeyError during evaluation

Hi,

thank you for providing the benchmark and easy-to-use codebase!

When evaluating the sickr-sts task, there exists a KeyError: 'validation'. The reason is that the 'validation' is included in the "eval_splits" of SickrSTS task description while mteb/sts15-sts only provides the test set. Should the 'validation' be removed from the task description?

S2S vs P2P

The BEIR tasks are currently all marked as S2S, but some of them are P2P or S2P / P2S. Retrieval is the only task where we have S2P / P2S. Does that make sense?

Options I see:

Add S2P & P2S to P2P assuming that the main use case for selecting S2S is to get short texts
Introduce S2P & P2S

Any thoughts? cc @NouamaneTazi

Add revision to datasets

I'd propose to add the commit hash of the revision to tasks:

from mteb import MTEB
from mteb.abstasks.AbsTaskReranking import AbsTaskReranking
from sentence_transformers import SentenceTransformer


class MindSmallReranking(AbsTaskReranking):
    @property
    def description(self):
        return {
            "name": "MindSmallReranking",
            "hf_hub_name": "mteb/mind_small",
            "description": "Microsoft News Dataset: A Large-Scale English Dataset for News Recommendation Research",
            "reference": "https://www.microsoft.com/en-us/research/uploads/prod/2019/03/nl4se18LinkSO.pdf",
            "type": "Reranking",
            "category": "s2s",
            "eval_splits": ["validation"],
            "eval_langs": ["en"],
            "main_score": "map",
            "revision": "75937953179...",
        }

model = SentenceTransformer("average_word_embeddings_komninos")
evaluation = MTEB(tasks=[MindSmallReranking()])
evaluation.run(model)

This is then fed into load_dataset via revision= & added to the results json file.

This partly addresses #21

Clarify why there a multiple runs in the logs

^ It should be explained in the logs why it seems to be repeating the same thing

Task: AmazonReviewsClassification, split: test, language: en. Running...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 40 training sentences...
Batches: 100%|██████████| 2/2 [00:03<00:00,  1.60s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 5000 test sentences...
Batches: 100%|██████████| 157/157 [03:04<00:00,  1.18s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Fitting logistic regression classifier...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Evaluating...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 40 training sentences...
Batches: 100%|██████████| 2/2 [00:02<00:00,  1.39s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 5000 test sentences...
Batches: 100%|██████████| 157/157 [03:04<00:00,  1.18s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Fitting logistic regression classifier...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Evaluating...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 40 training sentences...
Batches: 100%|██████████| 2/2 [00:03<00:00,  1.68s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 5000 test sentences...
Batches: 100%|██████████| 157/157 [03:04<00:00,  1.18s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Fitting logistic regression classifier...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Evaluating...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 40 training sentences...
Batches: 100%|██████████| 2/2 [00:04<00:00,  2.48s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 5000 test sentences...
Batches: 100%|██████████| 157/157 [03:04<00:00,  1.18s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Fitting logistic regression classifier...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Evaluating...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 40 training sentences...
Batches: 100%|██████████| 2/2 [00:03<00:00,  1.71s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 5000 test sentences...

No test split for Mind Small

Mind Small Reranking has no test split - For all other ds we use the test split afaik, so should we use the val split for that one?

How do you ensure that the comparisons are fair?

Hello, this work is wonderful. However, I have one question: How do you ensure that the comparisons are fair? Data leak may occur if some of the models use the train/test data for pretraining or finetuning, particularly for new models to submit.

CQADupstackRetrieval doesn't work

remove `sentence-transformers` dependency

as far as I can tell, the sentence-transformers dependency is not necessary for this code to run, only as a shorthand for model loading in cmd. Because installing sentence-transformers also installs torch, sentencepiece, tokenizers and transformers itself, this is quite a big dependency to package. Maybe the installation of sentence-transformers can be split off into an optional dependency?

i.e., pip install mteb[sentencetransformers] could install mteb packaged with sentencetransformers. When running the functionality that requires sentencetransformers, the user could be prompted to install it.

validation.tsv not present in qrels folder

Hi
I'm trying to run the below code on Colab

from mteb import MTEB
from sentence_transformers import SentenceTransformer
from mteb.tasks import QuoraRetrieval


model_name = "average_word_embeddings_komninos"
model = SentenceTransformer(model_name)

evaluation = MTEB(tasks=["QuoraRetrieval"]) # Only select clustering and retrieval tasks
results = evaluation.run(model, output_folder=f"results/{model_name}")

I get the following error

Retrieval
    - QuoraRetrieval, beir, s2s

ERROR:mteb.evaluation.MTEB:Error while evaluating QuoraRetrieval: File /root/.cache/huggingface/datasets/BeIR/quora/qrels/validation.tsv not present! Please provide accurate file.
ERROR:mteb.evaluation.MTEB:Please check all the error logs at: error_logs.txt

When I check the qrels folder, I only find dev and test tsvs. This issue occurs for other tasks as well, such as MSMARCO.
Any idea what I'm doing wrong?

SentenceTransformer hangs

Packages:

!pip install -q git+https://github.com/UKPLab/sentence-transformers.git
!pip install -q git+https://github.com/embeddings-benchmark/mteb.git
!pip install -q git+https://github.com/NouamaneTazi/beir.git@fix_drpes_ids
!pip install -q evaluate

Doing

import time
from mteb import MTEB
from sentence_transformers import SentenceTransformer

class SentenceTransformerX(SentenceTransformer):
  pass

model_name = "sentence-transformers/average_word_embeddings_komninos"


model = SentenceTransformerX(model_name)
evaluation = MTEB(tasks=["SciFact"])
a = time.time()
results = evaluation.run(model, output_folder=f"results/{model_name}", overwrite_results=True)
b = time.time()

hangs at

 p = ctx.Process(
                target=SentenceTransformer._encode_multi_process_worker,
                args=(process_id, device_name, self.model, input_queue, output_queue),
                daemon=True,
            )

I think you're the expert here - any ideas? @NouamaneTazi

This only affects the latest BEIR, i.e. I think it has something to do with DPRES. Using the below is fine

!pip install -q git+https://github.com/UKPLab/sentence-transformers.git
!pip install -q git+https://github.com/embeddings-benchmark/mteb.git
!pip install beir==1.0.0

CQADupstack tasks support

Since BEIR doesn't provide a HF dataset for the CQA corpus we uploaded this one I think: https://huggingface.co/datasets/mteb/cqadupstack-retrieval/tree/main/data

However it cannot be loaded currently - possibly cause of the different format than other BEIR datasets (json's instead of jsonl).

Thus, CQADupstack tasks only work with beir <= 1.0.0 using the old data loading as of right now

What prompt was used for instruct models?

Was Instruct model tuned per task or was the default setting used?

Do not skip if running new split

We currently skip when running a new split & a result file of the same name exists
It'd be better to run & append the new results of the new split to the existing result file.

Bug while loading MTOPIntentClassification?

In fact I'm not sure if this is a bug. Below is what I thought the problem was

Before evaluating for MTOPIntentClassification, mteb will download a module in cache. In my case the module is located at /data2/.cache/huggingface/modules/datasets_modules/datasets/mteb--mtop_intent/7353fdf5b13e9bfd297fbf98bf66e7e0ee626def6321bd9293bbc6ee1d5fae7b and there is a script called mtop_intent.py:

import json
import datasets

_DESCRIPTION = "MTOP: Multilingual Task-Oriented Semantic Parsing"
_LANGUAGES = ["en", "de", "es", "fr", "hi", "th"]

URL = "" # https://huggingface.co/datasets/mteb/mtop/resolve/main/"

The URL is empty so the module will assume the files are located at pwd, which causes an error.
I change the URL to

URL = "https://huggingface.co/datasets/mteb/mtop_intent/resolve/main/"

and everything works fine.

OSError: Expected to be able to read 12300328 bytes for message body, got 12300316

Hello,
I've been trying to evaluate several custom models on MTEB

I faced some errors,

2022-11-08 11:12:04.822852 >>> ClimateFEVER
Traceback (most recent call last):
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/mteb/evaluation/MTEB.py", line 235, in run
    results = task.evaluate(model, split, **kwargs)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/mteb/abstasks/AbsTaskRetrieval.py", line 93, in evaluate
    results = retriever.retrieve(corpus, queries)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/beir/retrieval/evaluation.py", line 23, in retrieve
    return self.retriever.search(corpus, queries, self.top_k, self.score_function, **kwargs)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/beir/retrieval/search/dense/exact_search_multi_gpu.py", line 150, in search
    cos_scores_top_k_values, cos_scores_top_k_idx, chunk_ids = metric.compute()
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/evaluate/module.py", line 433, in compute
    self._finalize()
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/evaluate/module.py", line 390, in _finalize
    self.data = Dataset(**reader.read_files([{"filename": f} for f in file_paths]))
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/datasets/arrow_reader.py", line 236, in read_files
    pa_table = self._read_files(files, in_memory=in_memory)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/datasets/arrow_reader.py", line 171, in _read_files
    pa_table: Table = self._get_table_from_filename(f_dict, in_memory=in_memory)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/datasets/arrow_reader.py", line 306, in _get_table_from_filename
    table = ArrowReader.read_table(filename, in_memory=in_memory)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/datasets/arrow_reader.py", line 325, in read_table
    return table_cls.from_file(filename)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/datasets/table.py", line 1036, in from_file
    table = _memory_mapped_arrow_table_from_file(filename)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/datasets/table.py", line 51, in _memory_mapped_arrow_table_from_file
    pa_table = opened_stream.read_all()
  File "pyarrow/ipc.pxi", line 691, in pyarrow.lib.RecordBatchReader.read_all
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Expected to be able to read 12300328 bytes for message body, got 12300316

(other retrieval tasks have the same issue)

Is there any workaround?

Is SciDocs the same as USEB SciDocs?

It's inconvenient to have two datasets with the same case insensitive name: SciDocs & SCIDOCS. E.g. MacOS is by default case insensitive.

Let's rename SciDocs? I'd propose SciDocsS2S or SciDocsRR instead, but maybe someone has a better idea.

Also do I understand correctly that SciDocs is the same SciDocs as in useb? I.e. it includes all the tasks of Cite, Co-Cite, Co-Read, Co-View, see their paper for details (+ recomm).

cc @loicmagne as I think you added it?

Evaluating New Embeddings

I understand that standard word embeddings (like average_word_embeddings_glove.6B.300d) are downloaded from hugging face, but is there code to evaluate new embeddings? I have a .txt file with vectors trained with the GloVe model that I would like to evaluate.

I see in the documentation that we can write our own encoder model that can be evaluated. But is there a way to only input a .txt file of the word embeddings for evaluation?

If there is no code to support a .txt file input, then for the encoder, is the input sentences tokenized already?

What does the BC in SprintDuplicateQuestionsBC stand for?

cc @NouamaneTazi ?

Propose chunked computation for the `RerankingEvaluator`

The MindSmallReranking dataset contains 2,362,514 queries, 107,968 positive docs, 2,550,123 negative docs.

Currently, RerankingEvaluator.compute_metrics_batched() just gather all texts together and encode them, which would require a lot of memory / GPU memory. (I got CUDA OOM on 32GB V100.)

I made minor modifications to the code to implement chunked computation, reducing memory usage.

If this change is acceptable, I would be glad to make a PR.
Thanks.

Add Billboard task

It's referenced in https://arxiv.org/pdf/2212.09741.pdf

Negative STS22 scores

STS22 scores should not be negative https://competitions.codalab.org/competitions/33835#results

task_langs don't work when task is specified

valuation = MTEB(tasks=["AmazonCounterfactualClassification"],task_langs=["zh"])
valuation.run(model)

Task: AmazonCounterfactualClassification, split: validation, language: en. Running...
^CTraceback (most recent call last):
File "", line 1, in

Make classification deterministic

The following code should give the same results

import logging

from mteb import MTEB
from sentence_transformers import SentenceTransformer

logging.basicConfig(level=logging.INFO)

model_name = "average_word_embeddings_komninos"
model = SentenceTransformer(model_name)
evaluation = MTEB(tasks=["Banking77Classification"])
evaluation.run(model, output_folder=None)

It would be nice to write a test for that as well in tests folder

Bitext Mining low scores

I'm getting the below for

from mteb import MTEB
from mteb.abstasks.AbsTaskClustering import AbsTaskClustering
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("average_word_embeddings_komninos")
evaluation = MTEB(tasks=["BUCC"])
evaluation.run(model)

{
  "dataset_version": null,
  "mteb_version": "0.0.2",
  "test": {
    "de-en": {
      "accuracy": 0.0017745302713987473,
      "f1": 0.0017745302713987473,
      "precision": 0.0017745302713987473,
      "recall": 0.0017745302713987473
    },
    "evaluation_time": 456.59,
    "fr-en": {
      "accuracy": 0.0,
      "f1": 0.0,
      "precision": 0.0,
      "recall": 0.0
    },
    "ru-en": {
      "accuracy": 6.927606511950121e-05,
      "f1": 6.927606511950121e-05,
      "precision": 6.927606511950121e-05,
      "recall": 6.927606511950121e-05
    },
    "zh-en": {
      "accuracy": 0.0,
      "f1": 0.0,
      "precision": 0.0,
      "recall": 0.0
    }
  }
}

Seems too low - I think there's a bug

cc @NouamaneTazi @loicmagne

Status of multi-gpu support for BEIR evaluation

It looks like the multi-gpu support for the BEIR benchmarks is still disabled as of https://github.com/embeddings-benchmark/mteb/releases/tag/1.0.1.

What is the current status of it? Is it actively developed in BEIR repo?

Btw. we are successful in running the multi gpu utils in the BEIR repository with downgraded dependencies but would like to switch over to MTEB to have a broader benchmark collection.

Add flag to force re-compute evaluations

Sometimes we would like to override already computed evaluations. It would be cool to have a flag override_results that would handle that

BitextMining could support both ways

For BUCC-fr-en we currently search from english texts given french texts.

a) Most uses cases are probably the inverse
b) We should generally support both ways / automatically run both (e.g. like it's done in https://arxiv.org/pdf/2007.01852.pdf)

What do you think @NouamaneTazi ?

TwitterSemEval2015 combines train, dev, and test

It looks like the TwitterSemEval2015 test data combines the train, dev, and test data from the original task. Was this intentional? My assumption is that some of the models would have trained on that data

Add embeddings speed

An important factor in choosing embeddings is the speed of embedding.

I suggest adding a "tab" in the evluation called "Speed" and it would be represented in sentences/sec for example (canalso be tokens/sec).

This is a very useful feature of the SBERT site for example:
https://www.sbert.net/docs/pretrained-models/msmarco-v3.html

and efficiency as a parameter is already mentioned in your paper.

Add hardware info to results file

It'd be nice to also have information about the hardware used in the results file in addition to the evaluation time if this is easy to get!

SICK-R / BIOSSES

If they're the same, one should be removed; if not should be fixed

Versioning

I think we should have some form of versioning.

E.g. for each task have an additional field in the json results file called "version" or "revision". We can set it to 0 for all tasks for now or to e.g. the commit string of the dataset on the Hub.

Add Hub link to all datasets in README

Optimise classification evaluation

Currently we use the following for classification:

We sample 8 training examples per class, compute the embeddings, and fit the LogReg classifier. We then evaluate on the (unchanged) dev / test set.
We repeat the previous step 10 times and compute the average for accuracy / f1 etc.

As the test set embeddings will be the same, we can compute the test set embeddings once and just need to feed them to the LR classifier. This will make the 10-times repeated evaluation much faster.

Additional Dataset: FLORES200

It's a great dataset for bitext mining! Any help welcome 🤗

Edit: Done via #218

Having inference or evaluation results

Hi all,
thank you for sharing this awesome repo!

I am having experiments on classification tasks.
I am wondering if the inference results (e.g., predicted class for each test sentence) and evaluation results (e.g., whether predicted class for each test sentence is correct) are available via some commands?

Best regards,
Jihyuk

KeyError: 'validation' for RedditClustering and StackOverflowDupQuestions

Is the validation set used for RedditClustering and StackOverflowDupQuestions? Related to #83 and #84.

2022-12-16 05:26:27.662809 >>> RedditClustering
Traceback (most recent call last):
File "/home/aiops/shenxd/Dependency/anaconda3/envs/HuggingFace/lib/python3.7/site-packages/mteb/evaluation/MTEB.py", line 235, in run
results = task.evaluate(model, split, **kwargs)
File "/home/aiops/shenxd/Dependency/anaconda3/envs/HuggingFace/lib/python3.7/site-packages/mteb/abstasks/AbsTaskClustering.py", line 17, in evaluate
for cluster_set in tqdm.tqdm(self.dataset[split], desc="Clustering"):
File "/home/aiops/shenxd/Dependency/anaconda3/envs/HuggingFace/lib/python3.7/site-packages/datasets/dataset_dict.py", line 57, in getitem
return super().getitem(k)
KeyError: 'validation'

2022-12-16 05:26:39.846393 >>> StackOverflowDupQuestions
Traceback (most recent call last):
File "/home/aiops/shenxd/Dependency/anaconda3/envs/HuggingFace/lib/python3.7/site-packages/mteb/evaluation/MTEB.py", line 235, in run
results = task.evaluate(model, split, **kwargs)
File "/home/aiops/shenxd/Dependency/anaconda3/envs/HuggingFace/lib/python3.7/site-packages/mteb/abstasks/AbsTaskReranking.py", line 21, in evaluate
data_split = self.dataset[split]
File "/home/aiops/shenxd/Dependency/anaconda3/envs/HuggingFace/lib/python3.7/site-packages/datasets/dataset_dict.py", line 57, in getitem
return super().getitem(k)
KeyError: 'validation'

Silent skipping

Currently when a task name is wrong nothing happens upon evaluation.run
I think it'd be nice to raise a warning that a task wasn't found

Add Prompt Retrieval

We could add the prompt retrieval benchmark: https://arxiv.org/abs/2209.01975

Error for loading ArxivClusteringP2P

Hello @Muennighoff , I encountered the following issue when loading ArxivClusteringP2P dataset

repro

from mteb import MTEB

def test_loading_data():
    eval = MTEB(tasks=["ArxivClusteringP2P"])
    eval.load_tasks_data()
    return 


if __name__ == "__main__":
    test_loading_data()

output:

Generating test split: 23 examples [00:04,  6.38 examples/s]Failed to read file '/root/.cache/huggingface/datasets/downloads/extracted/2368c5e45f666e09c88b163b1db73ad115ce53e3954755e8936da145b036ae4b' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Missing a closing quotation mark in string. in row 0
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/datasets/packaged_modules/json/json.py", line 153, in _generate_tables
    dataset = json.load(f)
  File "/usr/lib/python3.8/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 25447588)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/datasets/builder.py", line 1817, in _prepare_split_single
    for _, table in generator:
  File "/usr/local/lib/python3.8/dist-packages/datasets/packaged_modules/json/json.py", line 156, in _generate_tables
    raise e
  File "/usr/local/lib/python3.8/dist-packages/datasets/packaged_modules/json/json.py", line 132, in _generate_tables
    pa_table = paj.read_json(
  File "pyarrow/_json.pyx", line 259, in pyarrow._json.read_json
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: JSON parse error: Missing a closing quotation mark in string. in row 0

Adding a model to automated evaluation

I would like to add Universal Sentence Encoder family of models to the automated evaluation.

It is relatively simply to evaluate it (thanks for making it straightforward), but it is not clear how to create a pull request to add the model to automated evaluation on the website. Please advise.

# !pip install tensorflow_text 

import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer
import tensorflow as tf

embedder=hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3")

class USE():
    def encode(self, sentences, batch_size=32, **kwargs):
        embeddings = []
        for i in range(0, len(sentences), batch_size):
            batch_sentences = sentences[i:i+batch_size]
            batch_embeddings = embedder(batch_sentences)
            embeddings.extend(batch_embeddings)
        return embeddings


model = USE()

How to customize parameters for AbsTaskClassification

Hi all,

I would like to compare varying configurations for AbsTaskClassification.
For example, I am wondering about evaluation results with method=kNN.
But, I am not sure how can I change those parameters in python scripts.
Could you help me with this?

BTW, I am wondering which method was used for the performance presented in the paper/leaderboard, between method=kNN (as described in the paper; 3.2 Tasks and evaluation - Classification) and method=logReg (which is the default value for method param in the codes).

Best regards,
Jihyuk

Set `n_inits` explicitly in clustering tasks

When running clustering tasks, I keep seeing this warning:

FutureWarning: The default value of `n_init` will change from 3 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

To keep the behavior or MTEB stable across versions of sklearn you should set n_init to 3 explicitly. If someone happens to run this with an sklearn version >= 1.4, they would start getting different results. If you want, I can make a PR.

I'm on scikit-learn 1.2.2

Leaderboard on Huggingface is down

It seems like the leaderboard on Huggingface is down, https://huggingface.co/spaces/mteb/leaderboard, it just says "Preparing Space" until it times out.

Some other people on Huggingface having the same issue:
https://huggingface.co/spaces/mteb/leaderboard/discussions/7

Is there a static version of the leaderboard, or another way of accessing the data?

ERROR:mteb.evaluation.MTEB:Error while evaluating QuoraRetrieval: 'SimCSEWrapper' object has no attribute 'stop_multi_process_pool'

Hi! Thanks for this easy-to-use repo!
While I'm getting this error when running the example script https://github.com/embeddings-benchmark/mtebscripts/blob/main/run_array_simcse.py on retrieval benchmarks like QuoraRetrieval.
How do I evaluate on retrieval tasks when using my own model with a wrapper?

Multi-GPU for more tasks

It'd be great if we could figure out using multiple gpus on tasks other than BEIR.
E.g. RedditClusteringP2P takes >20h for a 5.8B model with embeddings of 4096 dimensions.