babelcloud / llm-rgb Goto Github PK

View Code? Open in Web Editor NEW

104.0 6.0 6.0 635 KB

LLM Reasoning and Generation Benchmark. Evaluate LLMs in complex scenarios systematically.

Home Page: http://llm-rgb.babel.run

License: MIT License

TypeScript 82.41% JavaScript 3.96% HTML 0.72% CSS 12.91%

benchmark llm prompt prompt-engineering prompt-testing

llm-rgb's Introduction

LLM Reasoning and Generation Benchmark

This repository contains a collection of detailed test cases (prompts) designed to evaluate the reasoning and generation capabilities of Language Learning Models (LLMs) in complex scenarios. It's important to note that this benchmark is not intended to be a comprehensive test for LLMs. The project was initially developed as an internal project at babel.cloud, with the aim of assessing the performance of LLMs in understanding context and complying with instructions.

Complex scenarios present three main challenges compared to chat or simple generations:

Context Length: A single prompt may contain more than 8000 tokens (approximately 20K characters).
Reasoning Depth: The generation of an answer may require multi-step reasoning.
Instruction Compliance: The LLM may need to generate a response in a specific format, rather than in natural language.

Each test case is a generation task for an LLM, without involving multi-turn conversations. The complexity of each test case is assessed based on the following dimensions:

Context Length Difficulty: 1 - 3

The value is 1 if the prompt contains 2000 characters or less. If the number of characters is between 2000 and 5000 (inclusive), the value is 2. If it's more than 5000, the value is 3. The model's actual performance in this dimension depends on the result of each task and the task's context length difficulty. It's not accurate to rate a model's ability in different context lengths solely based on the maximum context length that the model can handle.

Reasoning Depth Difficulty: 1 - 4

The value is 1 if the answer can be inferred directly from the context, such as a knowledge base. If the answer requires reasoning, the value is 2, for example, "Who is considered the father of the iPhone and what is the last digit of his birth year?". If the answer requires reasoning with the provided context, the value is 4, such as writing a program using the provided context syntax.

Instruction Compliance Difficulty: 1 - 3

The value is 1 if the expected response is in natural language without any special requirements. If the expected response should be in a specific style such as "YES or NO", "Shell command only", the value is 2. If the expected response requires a structural format such as JSON, YAML, the value is 3.

The difficulty of each test case (Dn) is the sum of the three difficulties. Each test case includes a set of assertions to evaluate the LLM's output. The result of the assertion (Rn) is a decimal between [0, 1]. The final score of the test case (Sn) is calculated as Rn x Dn. "n" is the test case number. The total score for each LLM is the sum of all test case scores (S1...Sn).

Score Table

The following tables show the evaluation results, executed on Jan. 24th, 2024. We ran the evaluation 10 times and take the average scores. The full score of all 15 testcases is 100.

Score by Abilities

Score by Testcases

Evaluation Details

Please check the following link for evaluation details of above table. Result-1 Result-2 Result-3 Result-4 Result-5 Result-6 Result-7 Result-8 Result-9 Result-10

gpt-4-turbo: openai:gpt-4-1106-preview
gpt-3.5: openai:gpt-3.5-turbo-1106
minimax: minimax:abab6-chat
chatglm: zhipu:glm-4
moonshot: moonshot:moonshot-v1-8k
baichuan2: baichuan:Baichuan2-Turbo
gemini-pro: google:gemini-pro
aliqwen: alibaba:qwen-max
baidu: baidu:ernie_bot_8k
Yi-34b-chat: 01-ai:yi-34b-chat
llama2: meta:llama-2-70b-chat

Quick Start

The testing tools used in this project are provided by promptfoo. To run evaluations, you need to fill in the LLM configurations in promptfooconfig.yaml. You should comment out any providers and test cases that you don't want to use.

npm install

npm run start

By default, the test result will be uploaded so that you can share the test result link. If you don't want to share the test result:

npm run start:noshare

If you don't have a suitable environment to run the tests, you can use LLM-RGB Online.

If you want to run these tests against LLMs that are not currently listed, you can add custom webhook providers in the same way as the existing ones.

Contribute Test Cases

We welcome contributions of test cases that can evaluate the reasoning and generation abilities of LLMs. Please refer to the existing test cases for the required files and formats.

llm-rgb's People

Contributors

Stargazers

Watchers

Forkers

fly88oj dolife vikingmew temberature ericxsun ironchariot

llm-rgb's Issues

Failures: 271

After running npm run start, it outputs error as follows:

Run promptfoo view to use the local web viewer
Run promptfoo share to create a shareable URL
Run promptfoo feedback to share feedback with the developers of this tool
=========================================================================================================================================================================================================
Successes: 14
Failures: 271
Token usage: Total 0, Prompt 0, Completion 0, Cached 0
Done.

> [email protected] render
> ts-node generateEvalScore.ts && ts-node render.ts

Data loaded from JSON file: /home/alex/.promptfoo/output/latest.json
Successfully wrote scores to /home/alex/.promptfoo/output/latest-score.json
Successfully wrote testStats to /home/alex/.promptfoo/output/latest-stats.json
Successfully wrote scores to /home/alex/.promptfoo/output/latest-raw.json
/home/alex/Documents/EvaluateLLM/render.ts:43
scoreData.forEach((item: any) => {
          ^
TypeError: Cannot read properties of undefined (reading 'push')
    at /home/alex/Documents/EvaluateLLM/render.ts:63:33
    at Array.forEach (<anonymous>)
    at /home/alex/Documents/EvaluateLLM/render.ts:62:17
    at Array.forEach (<anonymous>)
    at Object.<anonymous> (/home/alex/Documents/EvaluateLLM/render.ts:43:11)
    at Module._compile (node:internal/modules/cjs/loader:1254:14)
    at Module.m._compile (/home/alex/Documents/EvaluateLLM/node_modules/ts-node/src/index.ts:1618:23)
    at Module._extensions..js (node:internal/modules/cjs/loader:1308:10)
    at Object.require.extensions.<computed> [as .ts] (/home/alex/Documents/EvaluateLLM/node_modules/ts-node/src/index.ts:1621:12)
    at Module.load (node:internal/modules/cjs/loader:1117:32)
(base) alex@fst-computer:~/Documents/EvaluateLLM$ npm run start

> [email protected] start
> npm run eval;npm run render && npm run share


> [email protected] eval
> promptfoo eval --no-cache

关于麻将提示词的错误

GAME INFO:
Tiles Discarded in Previous Rounds: B6 B7 B8 C7 C9 D2 D2 D5 D5 D5 D8
Observe: Drew D4
Current Tiles: B3B3B3 B9B9B9 C7C7C7 D4D4 D5 D6 D4(just drew)

DECISION:
Thought: The tiles are close to a Bumps-win pattern. The newly drew D4 can form a bump D4 D4 D4. I need a pair to get a Bump-win. D5 and D6 can be the pair choices. Consider all other three D5 has been discarded already, it is impossible for me to get another D5 to form a pair. Thus I should keep D6 and discard D5.
Target Winning Pattern: Bumps-win
Winning Tile(s): D6
Action: Discard D5

本例中，你码到了D4后，就已经mixed-win了，所以这个例子，应该给出

Thought：The tiles are Mixed-Win pattern.The newly drew D4 can form a Straights D4-D5-D6
Target Winning Pattern:Mixed-win
Winning Tile(s):D4(just drew)
Action：None

Integrate with LiteLLM - Evaluate 100+LLMs, 92% faster

Hi @zhlmmc @Bazinga-Wang I'm the maintainer of LiteLLM. we allow you to create a proxy server to call 100+ LLMs to make it easier to run benchmark / evals .

I'm making this issue because I believe LiteLLM makes it easier for you to run benchmarks and evaluate LLMs (I'd love your feedback if it does not)

Try it here: https://docs.litellm.ai/docs/simple_proxy
https://github.com/BerriAI/litellm

Using LiteLLM Proxy Server

Creating a proxy server

Ollama models

$ litellm --model ollama/llama2 --api_base http://localhost:11434

Hugging Face Models

$ export HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]
$ litellm --model claude-instant-1

Anthropic

$ export ANTHROPIC_API_KEY=my-api-key
$ litellm --model claude-instant-1

Palm

$ export PALM_API_KEY=my-palm-key
$ litellm --model palm/chat-bison

Set api base to proxy

openai.api_base = "http://0.0.0.0:8000"

Using to run an eval on lm harness:

python3 -m lm_eval \
  --model openai-completions \
  --model_args engine=davinci \
  --task crows_pairs_english_age