Giter Club home page Giter Club logo

mera's Introduction

MERA

MERA

License Release

MERA (Multimodal Evaluation for Russian-language Architectures) is a new open benchmark for the Russian language for evaluating fundamental models.

About MERA

MERA benchmark brings together all industry and academic players in one place to study the capabilities of fundamental models, draw attention to AI problems, develop collaboration within the Russian Federation and in the international arena and create an independent unified system for measuring all current models. This repository is a customized version of original Language Model Evaluation Harness (LM-Harness v0.3.0).

Our contributions to this project are:

  • Instruction-based tasks available on 🤗 HuggingFace dataset card.
  • Customized version of LM-Harness evaluation code for models (v0.3.0).
  • Benchmark website with the Leaderboard and the scoring submission system.
  • Baselines of the open models and Human Benchmark.

The MERA benchmark includes 21 text tasks (17 base tasks + 4 diagnostic tasks). See the task-table for a complete list.

Name Task Name Task Type Test Size N-shots Metrics
MathLogicQA mathlogicqa Math, Logic 1143 5 Acc
MultiQ multiq Reasoning 900 0 EM / F1
PARus parus Common Sense 500 0 Acc
RCB rcb NLI 438 0 Acc / F1_macro
ruModAr rumodar Math, Logic 6000 0 Acc
ruMultiAr rumultiar Math 1024 5 Acc
ruOpenBookQA ruopenbookqa World Knowledge 400 5 Acc / F1_macro
ruTiE rutie Reasoning, Dialogue Context, Memory 430 0 Acc
ruWorldTree ruworldtree World Knowledge 525 5 Acc / F1_macro
RWSD rwsd Reasoning 260 0 Acc
SimpleAr simplear Math 1000 5 Acc
BPS bps Code, Math 1000 2 Acc
CheGeKa chegeka World Knowledge 416 4 EM / F1
LCS lcs Code, Math 500 2 Acc
ruHumanEval ruhumaneval Code 164 0 Pass@k
ruMMLU rummlu Reasoning 961 5 Acc
USE use Exam 900 0 Grade_norm
ruDetox rudetox Ethics 800 0 J(STA, SIM, FL)
ruEthics ruethics Ethics 1935 0 5 MCC
ruHateSpeech ruhatespeech Ethics 265 0 Acc
ruHHH ruhhh Ethics 178 0 Acc

Our aim is to evaluate all the models:

  • in the same scenarios;
  • using the same metrics;
  • with the same adaptation strategy (e.g., prompting);
  • provide an opportunity to make controlled and clear comparisons.

MERA is a collaborative project created in a union of industry and academia with the support of all the companies, that are creating the foundation models, to ensure fair and transparent leaderboards for the models evaluation.

We express our gratitude to our team and partners:

SberDevices, Sber AI, Yandex, Skoltech AI, MTS AI, NRU HSE, Russian Academy of Sciences, etc.

Powered by Aliance AI

Contents

The repository has the following structure:

  • examples — the examples of loading and using data.
  • humanbenchmarks — materials and code for human evaluation.
  • modules — the examples of scoring scripts that are used on the website for scoring your submission.
  • lm-evaluation-harness — a framework for few-shot evaluation of language models.

The process of submission is the following:

  • to view the datasets use the HuggingFace preview or run the prepared instruction;
  • clone MERA benchmark repository;
  • to get submission files use shell script and the provided customized lm-harness code (the actual model is not required for submission and evaluation).
  • run your model on the all datasets using the code of lm-harness: the result of the code is the archive in ZIP format for the submission;
  • register on the website;
  • upload the submission files (ZIP) via the platform interface for the automatic assessment.

Note that, the evaluation result is then displayed in the user's account and is kept private. Those who want to make their submission results public could use the ''Publish'' function. After validation of the submission is approved, the model's overall score will be shown publicly. The parameters of the generation, prompts and few-shot/zero-shot are fixed. You can vary them for your own purposes. If you want to submit your results on the public leaderboard check that these parameters are the same and please add the logs. We have to be sure that the scenarios for the models evaluation are the same and reproducible.

We provide the sample submission for you to check the format.

The process of the whole MERA evaluation is described on the Figure:

evaluation setup


📌 It’s the first text version of the benchmark. We are to expand and develop it in the future with new tasks and multimodality.

Feel free to ask any questions regarding our work, write on email [email protected]. If you have ideas and new tasks feel free to suggest them, it’s important! If you see any bugs, or you know how to make the code better please suggest the fixes via pull-requests and issues in this official github 🤗. We will be glad to get the feedback in any way.

Cite as

@misc{fenogenova2024mera,
    title={{MERA}: A Comprehensive {LLM} Evaluation in {Russian}}, 
    author={Alena Fenogenova and Artem Chervyakov and Nikita Martynov and Anastasia Kozlova and Maria Tikhonova and Albina Akhmetgareeva and Anton Emelyanov and Denis Shevelev and Pavel Lebedev and Leonid Sinev and Ulyana Isaeva and Katerina Kolomeytseva and Daniil Moskovskiy and Elizaveta Goncharova and Nikita Savushkin and Polina Mikhailova and Denis Dimitrov and Alexander Panchenko and Sergei Markov},
    year={2024},
    eprint={2401.04531},
    url = {https://arxiv.org/abs/2401.04531},
    eprinttype={arXiv},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    journal={arXiv},
    volume={2401.04531}
}

mera's People

Contributors

alenush avatar artemorloff avatar colindonolwe avatar ivansedykh avatar king-menin avatar lsinev avatar mariyatikhonova avatar meduzick avatar thehir0 avatar ulyanaisaeva avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mera's Issues

empty value rummlu

Добрый день!

Не расcчитывается метрика для задачи rummlu.

mmv@mmv:~/dev/mera/MERA/lm-evaluation-harness$ python3 ./main.py --model hf-causal-experimental --model_args pretrained=mistralai/Mistral-7B-v0.1,dtype=auto,use_accelerate=True,max_memory_per_gpu=20GB,max_length=4096 --output_base_path="$PWD/mera_results/Mistral-7B-v0.1_defaults" --batch_size=4 --write_out --tasks rummlu --num_fewshot=5 --output_path="$PWD/mera_results/Mistral-7B-v0.1_defaults/rummlu_result.json" --device cuda --limit 50

Selected Tasks: ['rummlu']

параметр --inference:

Task Version Metric Value Stderr
rummlu 0 metric 0 ± 0

без параметра --inference:
Traceback (most recent call last):
File "/home/mmv/dev/mera/MERA/lm-evaluation-harness/./main.py", line 126, in
main()
File "/home/mmv/dev/mera/MERA/lm-evaluation-harness/./main.py", line 84, in main
results = evaluator.simple_evaluate(
File "/home/mmv/dev/mera/MERA/lm-evaluation-harness/lm_eval/utils.py", line 238, in _wrapper
return fn(*args, **kwargs)
File "/home/mmv/dev/mera/MERA/lm-evaluation-harness/lm_eval/evaluator.py", line 197, in simple_evaluate
results = evaluate(
File "/home/mmv/dev/mera/MERA/lm-evaluation-harness/lm_eval/utils.py", line 238, in _wrapper
return fn(*args, **kwargs)
File "/home/mmv/dev/mera/MERA/lm-evaluation-harness/lm_eval/evaluator.py", line 980, in evaluate
results[task_name][metric + "_stderr"] = stderr(items)
File "/home/mmv/dev/mera/MERA/lm-evaluation-harness/lm_eval/metrics.py", line 25, in mean_stderr
return sample_stddev(arr) / math.sqrt(len(arr))
File "/home/mmv/dev/mera/MERA/lm-evaluation-harness/lm_eval/metrics.py", line 21, in sample_stddev
return math.sqrt(sum([(x - mu) ** 2 for x in arr]) / (len(arr) - 1))
ZeroDivisionError: float division by zero

0.4.0 lm-evaluation-harness

Hi!

Your benchmarks are functioning well with version 0.3.0 of lm-evaluation-harness. Are there any plans to update and support version 0.4.0?

Ошибка при сабмитах на mera.a-ai.ru

Добрый день.

При сабмите решения на сайт mera.a-ai.ru возникает "Ошибка" в подсчете метрики. Можно ли как-то посмотреть логи и узнать причину ошибки при отправке сабмита? Полагаю, что может быть битым или некорректно сформирован один json (на моей стороне) одного из бенчмарков во всем zip архиве. Есть ли возможность посмотреть на чем падает ошибка?

  1. Планируется ли доработка сайта по добавлению функционала просмотра логов?
  2. Есть ли в планах возможность скорить модели на отдельных бенчмарках, например, путем передачи одного json под конкретную задачу?

Спасибо.

Большие модели, не влезающие в одну карту, не параллелятся на несколько

Loading checkpoint shards: 0%| | 0/19 [00:00<?, ?it/s]
Loading checkpoint shards: 5%|▌ | 1/19 [00:01<00:24, 1.37s/it]
Loading checkpoint shards: 11%|█ | 2/19 [00:03<00:30, 1.80s/it]
Loading checkpoint shards: 16%|█▌ | 3/19 [00:06<00:37, 2.35s/it]
Loading checkpoint shards: 21%|██ | 4/19 [00:11<00:51, 3.41s/it]
Loading checkpoint shards: 26%|██▋ | 5/19 [00:17<00:58, 4.19s/it]
Loading checkpoint shards: 32%|███▏ | 6/19 [00:21<00:57, 4.38s/it]
Loading checkpoint shards: 37%|███▋ | 7/19 [00:27<00:58, 4.87s/it]
Loading checkpoint shards: 42%|████▏ | 8/19 [00:34<01:00, 5.51s/it]
Loading checkpoint shards: 47%|████▋ | 9/19 [00:38<00:51, 5.12s/it]
Loading checkpoint shards: 53%|█████▎ | 10/19 [00:44<00:47, 5.29s/it]
Loading checkpoint shards: 58%|█████▊ | 11/19 [00:51<00:47, 5.89s/it]
Loading checkpoint shards: 63%|██████▎ | 12/19 [00:54<00:34, 4.95s/it]slurmstepd: error: *** JOB 2971874 ON sc34 CANCELLED AT 2024-05-12T23:49:39 ***
slurmstepd: error: Detected 1 oom_kill event in StepId=2971874.batch. Some of the step tasks have been OOM Killed.

ситуация одинакова для 1,2,3 карт A100. Модель https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1

Значения логов бенчмарка

Прогоняем бенчмарк MERA на различных моделях и во всех в файлах <task_name>_result.json поля

"metric": 0.0,
"metric_stderr": 0.0

Является ли сабмит бенчмарк с данными метриками валидным? Метрики рассчитываются после сабмита?

Так же в некоторых заданиях (например chegeka) логиты равняются нулю, это особенность задач?

[Feature Request] Support for OpenAI ChatCompletion models

  • Поддерживается в оригинальной lm-evaluation-harness.
  • Позволяет тестировать неограниченный пул моделей через инструменты вроде vllm/llama.cpp-server/text-generation-webui/etc.
  • Настройка формата подсказки на стороне сервера.
  • Можно разделить машину для инференса и тестирования.
  • Можно тестировать проприетарные модели с openai-like api (например, mistral-medium).

tokenizer does not have a padding token

Модели (вся серия мистралей и их производные) у которых не определён pad токен падают на бенчмарках: rudetox, use, rumodar, multiq, rumultiar, simplear, chegeka с ошибкой:

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}).

Запускал так:

CUDA_VISIBLE_DEVICES=1 python main.py --model hf-causal-experimental --model_args pretrained=mistralai/Mistral-7B-v0.1,dtype=auto,max_length=11500 \
--device cuda --output_base_path="$PWD/mera_results/Mistral-7B-v0.1_defaults" --batch_size=1 \
--inference --write_out --no_cache --tasks rudetox \
--output_path="$PWD/mera_results/Mistral-7B-v0.1_defaults/rudetox_result.json"

Trace

Traceback (most recent call last):
  File ".../MERA/lm-evaluation-harness/main.py", line 141, in <module>
    main()
  File ".../MERA/lm-evaluation-harness/main.py", line 98, in main
    results = evaluator.simple_evaluate(
  File ".../MERA/lm-evaluation-harness/lm_eval/utils.py", line 238, in _wrapper
    return fn(*args, **kwargs)
  File ".../MERA/lm-evaluation-harness/lm_eval/evaluator.py", line 145, in simple_evaluate
    rudetox_results = evaluate(
  File ".../MERA/lm-evaluation-harness/lm_eval/utils.py", line 238, in _wrapper
    return fn(*args, **kwargs)
  File ".../MERA/lm-evaluation-harness/lm_eval/evaluator.py", line 1033, in evaluate
    resps = getattr(lm, reqtype)([req.args for req in reqs])
  File ".../MERA/lm-evaluation-harness/lm_eval/models/huggingface.py", line 504, in greedy_until
    token_context = self.tok_encode_batch(context)
  File ".../MERA/lm-evaluation-harness/lm_eval/models/huggingface.py", line 431, in tok_encode_batch
    return self.tokenizer(
  File ".../MERA/lm-evaluation-harness/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2803, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File ".../MERA/lm-evaluation-harness/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2889, in _call_one
    return self.batch_encode_plus(
  File ".../MERA/lm-evaluation-harness/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3071, in batch_encode_plus
    padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
  File ".../MERA/lm-evaluation-harness/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2708, in _get_padding_truncation_strategies
    raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

Раз бенчмарк добавлен в лидерборду, значит как-то уже исправляли?

Как добавить форматирование промпта?

Как добавить форматирование промпта для модели?

Некоторые модели тренируются с определенным шаблоном промпта. Например возьмем модель Open-Orca/Mistral-7B-OpenOrca.

Данная модель ожидает промпт следующего формата:

<|im_start|>system
You are MistralOrca, a large language model trained by Alignment Lab AI. Write out your reasoning step-by-step to be sure you get the right answers!
<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
I am doing well!<|im_end|>
<|im_start|>user
Please tell me about how mistral winds have attracted super-orcas.<|im_end|>
<|im_start|>assistant

Если подать в модель промпт другого формата, ответ будет существенно отличаться. Ниже приведен сниппет кода для воспроизведения.

import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

device = "cuda"
model = AutoModelForCausalLM.from_pretrained(
    "Open-Orca/Mistral-7B-OpenOrca",
    torch_dtype=torch.float16,
    device_map={"": 0},
)
tokenizer = AutoTokenizer.from_pretrained("Open-Orca/Mistral-7B-OpenOrca")

generation_config = GenerationConfig(
    max_length=256,
    temperature=1.1,
    top_p=0.95,
    repetition_penalty=1.0,
    # do_sample=True,
    use_cache=True,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
)


generation_config = GenerationConfig(
    max_length=256,
    temperature=1.1,
    top_p=0.95,
    repetition_penalty=1.0,
    # do_sample=True,
    use_cache=True,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
)


def generate_original_orca(instruction):
    chat = [
        {"role": "user", "content": instruction},
    ]
    chat = tokenizer.apply_chat_template(
        chat, tokenize=False, add_generation_prompt=True
    )
    print(chat)

    inputs = tokenizer(chat, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, generation_config=generation_config)[0]
    outputs = outputs[len(inputs["input_ids"][0]) :]
    text = tokenizer.decode(outputs)
    text = text.replace("<|im_end|>", "").strip()
    return text


def generate_original_orca_no_template(instruction):
    inputs = tokenizer(instruction, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, generation_config=generation_config)[0]
    outputs = outputs[len(inputs["input_ids"]) :]
    text = tokenizer.decode(outputs)
    text = text.strip()
    return text


print("CHAT TEMPLATE")
print(generate_original_orca("Почему трава зеленая?"))
print("NO TEMPLATE")
print(generate_original_orca_no_template("Почему трава зеленая?"))
CHAT TEMPLATE
<|im_start|>user
Почему трава зеленая?<|im_end|>
<|im_start|>assistant

Зеленой окраской у растений обуславливается наличие хлорофилла, который является основным пигментом, отвечающим за процесс фотосинтеза. Фотосинтез - это процесс, благодаря которому растения превращают свет, углекислый газ и воду в углеводы и кислород. Хлорофилл поглощает свет, в основном в зелёном диапазоне, и превращает его в энергию, необходимую для синтеза углеводов. Таким образом, зеленая окраска растений обусловлена наличием хлорофилла и процессом фотосинтеза, который обеспечивает их рост и развитие.
NO TEMPLATE
Почему трава зеленая?

The question "Почему трава зеленая?" translates to "Why is the grass green?" in English. This question is often asked by children who are curious about the colors they see in nature.

The color green is associated with grass, leaves, and other plants because of the presence of chlorophyll, a pigment that is responsible for the process of photosynthesis. Photosynthesis is the process by which plants, algae, and some bacteria convert light energy into chemical energy in the form of glucose. This process is essential for the growth and survival of plants, as it allows them to produce their own food using sunlight, water, and carbon dioxide.

Chlorophyll is a green pigment because it absorbs light most efficiently in the blue and red parts of the spectrum, while reflecting green light. This is why plants appear green to our eyes. The other colors we see in plants, such as red, orange, and yellow, are due to the presence of other pigments called carotenoids and anthocyanins. These pigments are responsible for the vibrant colors we see in

Скоринг GGUF моделей

Доброго времени суток.

Существуют ли примеры запуска скоринга с квантизованными моделями GGUF формата?
Где запускать скоринг моделей, которые не являются transformers.PreTrainedModel или не лежат на hugging face? Например кастомная модель, сохранённая на диске.

Спасибо.

Как проскорить модель без метода loglikelihood?

Добрый день!

Хотелось бы посчитать метрику на бенчмарке для модели доступной только по API (например chatGPT, BARD и тп.). Как в даном случае проскорить модель, если по API модель не возвращает logprobs?

На сколько я понимаю мы должны уметь для скоринга формировать словарь вида:

prompt_0:"Задание содержит вопрос по теме Математика и 4 варианта ответа A, B, C, D, из которых только один правильный. Выберите букву правильного ответа: Чему равен корень из 144? A 14 B 12 C 4 D 44 Ответ: A"
prompt_1:"Задание содержит вопрос по теме Математика и 4 варианта ответа A, B, C, D, из которых только один правильный. Выберите букву правильного ответа: Чему равен корень из 144? A 14 B 12 C 4 D 44 Ответ: B"
prompt_2:"Задание содержит вопрос по теме Математика и 4 варианта ответа A, B, C, D, из которых только один правильный. Выберите букву правильного ответа: Чему равен корень из 144? A 14 B 12 C 4 D 44 Ответ: C"
prompt_3:"Задание содержит вопрос по теме Математика и 4 варианта ответа A, B, C, D, из которых только один правильный. Выберите букву правильного ответа: Чему равен корень из 144? A 14 B 12 C 4 D 44 Ответ: D"
logit_0:-0.9664535356921388
logit_1:-0.4407325991753527
logit_2:-0.007491470058587191
logit_3:-0.9109759624491242

Есть ли возможность скорить модели использую только сгенерированный текст, а не логиты модели?

Как бенчмарк закрытой модели, у которой нету метода loglikelihood?

Добрый день.
Хотим поскорить закрытые модели - шаблон для Anthropic не работает(нет метода loglikelihood и токенизатора).
В самом фреймворке, который вы используете, есть метод генерации для закрытых моделей generate_until (no logprobs). Это как-то надо прикручивать к текущему коду оценки MERA?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.