Giter Club home page Giter Club logo

ada-leval's Introduction

Ada-LEval

The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"

Ada-LEval is a pioneering benchmark to assess the long-context capabilities with length-adaptable questions. It comprises two challenging tasks: TSort, which involves arranging text segments into the correct order, and BestAnswer, which requires choosing the best answer of a question among multiple candidates.

Both tasks feature the following advantages:

  1. Controllable Test Cases: The length of each test case can be finely tuned - by adjusting the number and length of text segments in TSort and altering the number of distractor options in BestAnswer.
  2. Necessity for Full-Text Comprehension: Successful completion of both tasks mandates complete reading and understanding of the provided text.
  3. Precise Accuracy Measurement: The design of these tasks allows for unambiguous accuracy calculation. TSort has a definitive 'correct' order, while in BestAnswer, the annotated responses by the questioner serve as definitive answers.

🛠️QuickStart

In this repo, we implement the evaluation of Ada-LEval on GPT-4-Turbo-0125 (an example for APIs) and internlm2-[7b/20b] (an example for opensource LLMs). You can follow our implementation to evaluate Ada-LEval on your custom LLMs.

  1. Preparation

    1. Installation and data preparation

      cd Ada-LEval
      pip install -e . 
      bash fetch_data.sh
    2. For evaluating GPT-4, please set the environment variable: export OPENAI_API_KEY=sk-xxxxx

      • Cost Estimation for GPT-4-Turbo-0125: setting (2k, 4k, etc.) * n_samples * $0.01 / 1000
    3. For evaluating InternLM2-7B, please follow the official guide to install LMDeploy.

  2. Evaluate GPT-4-Turbo-0125: python run.py --data {dataset_name} --model gpt-4-0125

  3. Evaluate InternLM2-7B: bash run.sh --data {dataset_name} --model internlm2-7b

* dataset_name can be stackselect_{setting} (for BestAnswer) or textsort_{setting} (for TSort). For example, stackselect_16k, textsort_2k, etc.

** run.sh detect the number of available GPUs and do the data parallel.

📊Evaluation Result

Here is the evaluation result of TSort and BestAnswer benchmark under long-context & ultra-long-context settings. We also provide a 'random guess' baseline for each task.

Definition: long-context -> context window < 32k; ultra-long-context: context-window >= 32k

The Number of Evaluation Samples: 1. API models on long-context: 200; 2. API models on ultra-long-context: 50; 3. Open-source models on long-context: 1000; 4. Open-source models on ultra-long-context: 200.

TL;DR:

  1. TSort is an extremely challenging benchmark: We observe positive results (significantly better than random guess) only when evaluating SOTA API models (GPT-4 series) under short context settings (< 8k).
  2. BestAnswer is a challenging long-context benchmark with discrimination: With 32k long-context, GPT-4-Turbo-0125 still obtains a decent 30% accuracy, while other models significantly lag behind. When the context window is 64k or even longer, models failed to solve almost all of the questions.

TSort Evaluation Results

Blanks indicate the result under the corresponding setting is not evaluated.

TSort 2k 4k 8k 16k 32k 64k 128k
GPT-4-Turbo-0125 15.5 16.5 8.5 5.5 2.0 4.0 2.0
GPT-4-Turbo-1106 18.5 15.5 7.5 3.5 6.0 6.0 6.0
GPT-3.5-Turbo-1106 4.0 4.5 4.5 5.5
Claude-2 5.0 5.0 4.5 3.0 0.0 0.0
LongChat-7b-v1.5-32k 5.3 5.0 3.1 2.5
ChatGLM2-6B-32k 0.9 0.7 0.2 0.9
ChatGLM3-6B-32k 2.3 2.4 2.0 0.7
Vicuna-7b-v1.5-16k 5.3 2.2 2.3 1.7
Vicuna-13b-v1.5-16k 5.4 5.0 2.4 3.1
InternLM2-7b 5.1 3.9 5.1 4.3
Random Guess 4.2 4.2 4.2 4.2 4.2 4.2 4.2

BestAnswer Evaluation Results

Blanks indicate the result under the corresponding setting is not evaluated.

BestAnswer 1k 2k 4k 6k 8k 12k 16k 32k 64k 128k
GPT-4-Turbo-0125 73.5 73.5 65.5 63.0 56.5 52.0 44.5 30.0 0.0 0.0
GPT-4-Turbo-1106 74.0 73.5 67.5 59.5 53.5 49.5 44.0 16.0 0.0 0.0
GPT-3.5-Turbo-1106 61.5 48.5 41.5 29.5 17.0 2.5 2.5
Claude-2 65.0 43.5 23.5 15.0 17.0 12.0 11.0 4.0 0.0
LongChat-7b-v1.5-32k 32.4 10.7 5.7 3.1 1.9 1.6 0.8
ChatGLM2-6B-32k 31.2 10.9 4.5 1.6 1.6 0.0 0.3
ChatGLM3-6B-32k 39.8 18.8 9.0 5.0 3.4 0.9 0.5
Vicuna-7b-v1.5-16k 37.0 11.1 5.8 3.2 1.8 1.9 1.0
Vicuna-13b-v1.5-16k 53.4 29.2 13.1 4.3 2.2 1.4 0.9
InternLM2-7b 58.6 49.5 33.9 12.3 13.4 2.0 0.8 0.5 0.5 0.0
Random Guess 26.7 10.1 4.5 3.0 2.3 1.4 1.1 0.6 0.3 0.1

🖊️Citation

@misc{wang2024adaleval,
      title={Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks}, 
      author={Chonghua Wang and Haodong Duan and Songyang Zhang and Dahua Lin and Kai Chen},
      year={2024},
      eprint={2404.06480},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

ada-leval's People

Contributors

kennymckormick avatar philipwangovo avatar

Stargazers

 avatar HT Liu avatar Yuzhe Zhang avatar Xijie Huang avatar Yilong Zhao avatar  avatar young_chao avatar KuangJun avatar max yue avatar song avatar Nigel Mathes avatar Bin avatar Hikari avatar Sen Wang avatar Rob avatar Junghwan Park avatar  avatar Daniel Hart-Gormly avatar Zakaria avatar  avatar Myungchul Shin avatar Marty Sullivan avatar Domenico Loschiavo avatar  avatar Miguel Rico avatar Andrés Ávila avatar Hogan Kangas avatar Tim Yardley avatar André Cristóvão Neves Ferreira avatar Jingbo  avatar  avatar Tong avatar Huanxuan Liao avatar Yuxuan Qiao avatar Jingming Zhuo avatar Xinyu Fang  avatar  avatar Alexander Lam avatar  avatar Fengzhe Zhou avatar  avatar Junming Yang avatar  avatar klein avatar  avatar

Watchers

Songyang Zhang avatar Kai Chen avatar Fengzhe Zhou avatar  avatar

ada-leval's Issues

BestAnswer任务的评估打分逻辑存在问题导致分数偏高

我选取了BestAnswer任务评估大部分的开源模型,并且在开源模型和闭源模型的测试中看到了明显的区分度(也符合我在其它基准上获取的先验排名结果),这让我认为该基准设计是非常优异的。然而glm-4开源以后,我继续使用BestAnswer任务来测试其长文本能力,发现评估结果异常好,甚至在所有模型中仅次于GPT-4o,这让我意识到评估环节存在问题。我检查评估打分的代码以后发现,逻辑上存在明显的问题。
原始评分代码如下:

def extract(line):
    nc = line['num_choice']
    cands = [f'A{i}' for i in range(1, nc + 1)]
    finds = [line['prediction'].find(c) for c in cands]
    matched = sum([x >= 0 for x in finds])
    if matched >= 1:
        for i in range(nc - 1, -1, -1):
            if finds[i] >= 0:
                return cands[i]
    else:
        cands = [str(i) for i in range(1, nc + 1)]
        finds = [line['prediction'].find(c) for c in cands]
        matched = sum([x >= 0 for x in finds])
        if matched >= 1:
            for i in range(nc - 1, -1, -1):
                if finds[i] >= 0:
                    return 'A' + cands[i]
        else:
            return '???'

显然该逻辑是找出所有选项列表中最靠后在模型response中出现过的选项(我猜测是为了避免在A41之前匹配A4等),然而在这种逻辑下所有模型的评估分数都会一定程度上偏高。我修改成如下逻辑(我认为真正遵循指令的模型都应该将正确答案在最先给出),则所有评估分数都出现一定幅度下降。

    def extract(line):
        nc = line['num_choice']
        cands = [f'A{i}' for i in range(1, nc + 1)]
        finds = []
        for c in cands:
            pos = line['response'].find(c)
            if pos != -1:
                # 检查匹配项后是否紧跟非数字字符或到达字符串结尾
                end_pos = pos + len(c)
                if end_pos == len(line['response']) or not line['response'][end_pos].isdigit():
                    finds.append((c, pos))
        
        if finds:
            # 选择最先出现的匹配项
            first_match = min(finds, key=lambda x: x[1])
            return first_match[0]
        else:
            return '???'

Will this benchmark be added to the opencompass evaluation framework?

I have tried many benchmarks, including implementing my own, and I think BestAnswer Evaluation is the most reliable recently for evaluating long context capability of LLM, especially in terms of code-related capabilities.
At present, I have rewritten the inference and evaluation code based on the current implementation to support the evaluation of more models directly based on VLLM as a temporary solution.
I would like to know if this benchmark will be added to OPENCOMPASS in the near future. I think this will be more convenient for scientific research.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.