Giter Club home page Giter Club logo

llm-rgb's Introduction

LLM Reasoning and Generation Benchmark

This repository contains a collection of detailed test cases (prompts) designed to evaluate the reasoning and generation capabilities of Language Learning Models (LLMs) in complex scenarios. It's important to note that this benchmark is not intended to be a comprehensive test for LLMs. The project was initially developed as an internal project at babel.cloud, with the aim of assessing the performance of LLMs in understanding context and complying with instructions.

Complex scenarios present three main challenges compared to chat or simple generations:

  1. Context Length: A single prompt may contain more than 8000 tokens (approximately 20K characters).
  2. Reasoning Depth: The generation of an answer may require multi-step reasoning.
  3. Instruction Compliance: The LLM may need to generate a response in a specific format, rather than in natural language.

Each test case is a generation task for an LLM, without involving multi-turn conversations. The complexity of each test case is assessed based on the following dimensions:

Context Length Difficulty: 1 - 3

The value is 1 if the prompt contains 2000 characters or less. If the number of characters is between 2000 and 5000 (inclusive), the value is 2. If it's more than 5000, the value is 3. The model's actual performance in this dimension depends on the result of each task and the task's context length difficulty. It's not accurate to rate a model's ability in different context lengths solely based on the maximum context length that the model can handle.

Reasoning Depth Difficulty: 1 - 4

The value is 1 if the answer can be inferred directly from the context, such as a knowledge base. If the answer requires reasoning, the value is 2, for example, "Who is considered the father of the iPhone and what is the last digit of his birth year?". If the answer requires reasoning with the provided context, the value is 4, such as writing a program using the provided context syntax.

Instruction Compliance Difficulty: 1 - 3

The value is 1 if the expected response is in natural language without any special requirements. If the expected response should be in a specific style such as "YES or NO", "Shell command only", the value is 2. If the expected response requires a structural format such as JSON, YAML, the value is 3.

The difficulty of each test case (Dn) is the sum of the three difficulties. Each test case includes a set of assertions to evaluate the LLM's output. The result of the assertion (Rn) is a decimal between [0, 1]. The final score of the test case (Sn) is calculated as Rn x Dn. "n" is the test case number. The total score for each LLM is the sum of all test case scores (S1...Sn).

Score Table

The following tables show the evaluation results, executed on Jan. 24th, 2024. We ran the evaluation 10 times and take the average scores. The full score of all 15 testcases is 100.

Score by Abilities

image

Score by Testcases

image

Evaluation Details

Please check the following link for evaluation details of above table. Result-1 Result-2 Result-3 Result-4 Result-5 Result-6 Result-7 Result-8 Result-9 Result-10

  1. gpt-4-turbo: openai:gpt-4-1106-preview
  2. gpt-3.5: openai:gpt-3.5-turbo-1106
  3. minimax: minimax:abab6-chat
  4. chatglm: zhipu:glm-4
  5. moonshot: moonshot:moonshot-v1-8k
  6. baichuan2: baichuan:Baichuan2-Turbo
  7. gemini-pro: google:gemini-pro
  8. aliqwen: alibaba:qwen-max
  9. baidu: baidu:ernie_bot_8k
  10. Yi-34b-chat: 01-ai:yi-34b-chat
  11. llama2: meta:llama-2-70b-chat

Quick Start

The testing tools used in this project are provided by promptfoo. To run evaluations, you need to fill in the LLM configurations in promptfooconfig.yaml. You should comment out any providers and test cases that you don't want to use.

npm install
npm run start

By default, the test result will be uploaded so that you can share the test result link. If you don't want to share the test result:

npm run start:noshare

If you don't have a suitable environment to run the tests, you can use LLM-RGB Online.

If you want to run these tests against LLMs that are not currently listed, you can add custom webhook providers in the same way as the existing ones.

Contribute Test Cases

We welcome contributions of test cases that can evaluate the reasoning and generation abilities of LLMs. Please refer to the existing test cases for the required files and formats.

llm-rgb's People

Contributors

bazinga-wang avatar fly88oj avatar vangie avatar zhlmmc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

llm-rgb's Issues

Failures: 271

After running npm run start, it outputs error as follows:

Run promptfoo view to use the local web viewer
Run promptfoo share to create a shareable URL
Run promptfoo feedback to share feedback with the developers of this tool
=========================================================================================================================================================================================================
Successes: 14
Failures: 271
Token usage: Total 0, Prompt 0, Completion 0, Cached 0
Done.

> [email protected] render
> ts-node generateEvalScore.ts && ts-node render.ts

Data loaded from JSON file: /home/alex/.promptfoo/output/latest.json
Successfully wrote scores to /home/alex/.promptfoo/output/latest-score.json
Successfully wrote testStats to /home/alex/.promptfoo/output/latest-stats.json
Successfully wrote scores to /home/alex/.promptfoo/output/latest-raw.json
/home/alex/Documents/EvaluateLLM/render.ts:43
scoreData.forEach((item: any) => {
          ^
TypeError: Cannot read properties of undefined (reading 'push')
    at /home/alex/Documents/EvaluateLLM/render.ts:63:33
    at Array.forEach (<anonymous>)
    at /home/alex/Documents/EvaluateLLM/render.ts:62:17
    at Array.forEach (<anonymous>)
    at Object.<anonymous> (/home/alex/Documents/EvaluateLLM/render.ts:43:11)
    at Module._compile (node:internal/modules/cjs/loader:1254:14)
    at Module.m._compile (/home/alex/Documents/EvaluateLLM/node_modules/ts-node/src/index.ts:1618:23)
    at Module._extensions..js (node:internal/modules/cjs/loader:1308:10)
    at Object.require.extensions.<computed> [as .ts] (/home/alex/Documents/EvaluateLLM/node_modules/ts-node/src/index.ts:1621:12)
    at Module.load (node:internal/modules/cjs/loader:1117:32)
(base) alex@fst-computer:~/Documents/EvaluateLLM$ npm run start

> [email protected] start
> npm run eval;npm run render && npm run share


> [email protected] eval
> promptfoo eval --no-cache

关于麻将提示词的错误

GAME INFO:
Tiles Discarded in Previous Rounds: B6 B7 B8 C7 C9 D2 D2 D5 D5 D5 D8
Observe: Drew D4
Current Tiles: B3B3B3 B9B9B9 C7C7C7 D4D4 D5 D6 D4(just drew)

DECISION:
Thought: The tiles are close to a Bumps-win pattern. The newly drew D4 can form a bump D4 D4 D4. I need a pair to get a Bump-win. D5 and D6 can be the pair choices. Consider all other three D5 has been discarded already, it is impossible for me to get another D5 to form a pair. Thus I should keep D6 and discard D5.
Target Winning Pattern: Bumps-win
Winning Tile(s): D6
Action: Discard D5

本例中,你码到了D4后,就已经mixed-win了,所以这个例子,应该给出

Thought:The tiles are Mixed-Win pattern.The newly drew D4 can form a Straights D4-D5-D6
Target Winning Pattern:Mixed-win
Winning Tile(s):D4(just drew)
Action:None

Integrate with LiteLLM - Evaluate 100+LLMs, 92% faster

Hi @zhlmmc @Bazinga-Wang I'm the maintainer of LiteLLM. we allow you to create a proxy server to call 100+ LLMs to make it easier to run benchmark / evals .

I'm making this issue because I believe LiteLLM makes it easier for you to run benchmarks and evaluate LLMs (I'd love your feedback if it does not)

Try it here: https://docs.litellm.ai/docs/simple_proxy
https://github.com/BerriAI/litellm

Using LiteLLM Proxy Server

Creating a proxy server

Ollama models

$ litellm --model ollama/llama2 --api_base http://localhost:11434

Hugging Face Models

$ export HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]
$ litellm --model claude-instant-1

Anthropic

$ export ANTHROPIC_API_KEY=my-api-key
$ litellm --model claude-instant-1

Palm

$ export PALM_API_KEY=my-palm-key
$ litellm --model palm/chat-bison

Set api base to proxy

openai.api_base = "http://0.0.0.0:8000"

Using to run an eval on lm harness:

python3 -m lm_eval \
  --model openai-completions \
  --model_args engine=davinci \
  --task crows_pairs_english_age

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.