Giter Club home page Giter Club logo

promptbench's Introduction

Contributors Forks Stargazers Issues


Logo

PromptBench: A Unified Library for Evaluating and Understanding Large Language Models.
Paper · Documentation · Leaderboard · More papers

Table of Contents
  1. News and Updates
  2. Introduction
  3. Installation
  4. Usage
  5. Datasets and Models
  6. Benchmark Results
  7. Acknowledgments

News and Updates

  • [13/03/2024] Add support for multi-modal models and datasets.
  • [05/01/2024] Add support for BigBench Hard, DROP, ARC datasets.
  • [16/12/2023] Add support for Gemini, Mistral, Mixtral, Baichuan, Yi models.
  • [15/12/2023] Add detailed instructions for users to add new modules (models, datasets, etc.) examples/add_new_modules.md.
  • [05/12/2023] Published promptbench 0.0.1.

Introduction

PromptBench is a Pytorch-based Python package for Evaluation of Large Language Models (LLMs). It provides user-friendly APIs for researchers to conduct evaluation on LLMs. Check the technical report: https://arxiv.org/abs/2312.07910.

Code Structure

What does promptbench currently provide?

  1. Quick model performance assessment: We offer a user-friendly interface that allows for quick model building, dataset loading, and evaluation of model performance.
  2. Prompt Engineering: We implemented several prompt engineering methods. For example: Few-shot Chain-of-Thought [1], Emotion Prompt [2], Expert Prompting [3] and so on.
  3. Evaluating adversarial prompts: promptbench integrated prompt attacks [4], enabling researchers to simulate black-box adversarial prompt attacks on models and evaluate their robustness (see details here).
  4. Dynamic evaluation to mitigate potential test data contamination: we integrated the dynamic evaluation framework DyVal [5], which generates evaluation samples on-the-fly with controlled complexity.

Installation

Install via pip

We provide a Python package promptbench for users who want to start evaluation quickly. Simply run:

pip install promptbench

Note that the pip installation could be behind the recent updates. So, if you want to use the latest features or develop based on our code, you should install via GitHub.

Install via GitHub

First, clone the repo:

git clone [email protected]:microsoft/promptbench.git

Then,

cd promptbench

To install the required packages, you can create a conda environment:

conda create --name promptbench python=3.9

then use pip to install required packages:

pip install -r requirements.txt

Note that this only installed basic python packages. For Prompt Attacks, you will also need to install TextAttack.

Usage

promptbench is easy to use and extend. Going through the examples below will help you get familiar with promptbench for quick use, evaluate existing datasets and LLMs, or create your own datasets and models.

Please see Installation to install promptbench first.

If promptbench is installed via pip, you can simply do:

import promptbench as pb

If you installed promptbench from git and want to use it in other projects:

import sys

# Add the directory of promptbench to the Python path
sys.path.append('/home/xxx/promptbench')

# Now you can import promptbench by name
import promptbench as pb

We provide tutorials for:

  1. evaluate models on existing benchmarks: please refer to the examples/basic.ipynb for constructing your evaluation pipeline. For a multi-modal evaluation pipeline, please refer to examples/multimodal.ipynb
  2. test the effects of different prompting techniques:
  3. examine the robustness for prompt attacks, please refer to examples/prompt_attack.ipynb to construct the attacks.
  4. use DyVal for evaluation: please refer to examples/dyval.ipynb to construct DyVal datasets.

Implemented Components

PromptBench currently supports different datasets, models, prompt engineering methods, adversarial attacks, and more. You are welcome to add more.

Datasets

  • Language datasets:
    • GLUE: SST-2, CoLA, QQP, MRPC, MNLI, QNLI, RTE, WNLI
    • MMLU
    • BIG-Bench Hard (Bool logic, valid parentheses, date...)
    • Math
    • GSM8K
    • SQuAD V2
    • IWSLT 2017
    • UN Multi
    • CSQA (CommonSense QA)
    • Numersense
    • QASC
    • Last Letter Concatenate
  • Multi-modal datasets:
    • VQAv2
    • NoCaps
    • MMMU
    • MathVista
    • AI2D
    • ChartQA
    • ScienceQA

Models

Language models:

  • Open-source models:
    • google/flan-t5-large
    • databricks/dolly-v1-6b
    • Llama2 series
    • vicuna-13b, vicuna-13b-v1.3
    • Cerebras/Cerebras-GPT-13B
    • EleutherAI/gpt-neox-20b
    • Google/flan-ul2
    • phi-1.5 and phi-2
  • Proprietary models
    • PaLM 2
    • GPT-3.5
    • GPT-4
    • Gemini Pro

Multi-modal models:

  • Open-source models:
    • BLIP2
    • LLaVA
    • Qwen-VL, Qwen-VL-Chat
    • InternLM-XComposer2-VL
  • Proprietary models
    • GPT-4v
    • Gemini Pro Vision
    • Qwen-VL-Max, Qwen-VL-Plus

Prompt Engineering

  • Chain-of-thought (COT) [1]
  • EmotionPrompt [2]
  • Expert prompting [3]
  • Zero-shot chain-of-thought
  • Generated knowledge [6]
  • Least to most [7]

Adversarial Attacks

  • Character-level attack
    • DeepWordBug
    • TextBugger
  • Word-level attack
    • TextFooler
    • BertAttack
  • Sentence-level attack
    • CheckList
    • StressTest
  • Semantic-level attack
    • Human-crafted attack

Protocols and Analysis

  • Standard evaluation
  • Dynamic evaluation
  • Semantic evaluation
  • Benchmark results
  • Visualization analysis
  • Transferability analysis
  • Word frequency analysis

Benchmark Results

Please refer to our benchmark website for benchmark results on Prompt Attacks, Prompt Engineering and Dynamic Evaluation DyVal.

Acknowledgements

  • TextAttack
  • README Template
  • We thank the volunteers: Hanyuan Zhang, Lingrui Li, Yating Zhou for conducting the semantic preserving experiment in Prompt Attack benchmark.

Reference

[1] Jason Wei, et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." arXiv preprint arXiv:2201.11903 (2022).

[2] Cheng Li, et al. "Emotionprompt: Leveraging psychology for large language models enhancement via emotional stimulus." arXiv preprint arXiv:2307.11760 (2023).

[3] BenFeng Xu, et al. "ExpertPrompting: Instructing Large Language Models to be Distinguished Experts" arXiv preprint arXiv:2305.14688 (2023).

[4] Zhu, Kaijie, et al. "PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts." arXiv preprint arXiv:2306.04528 (2023).

[5] Zhu, Kaijie, et al. "DyVal: Graph-informed Dynamic Evaluation of Large Language Models." arXiv preprint arXiv:2309.17167 (2023).

[6] Liu J, Liu A, Lu X, et al. Generated knowledge prompting for commonsense reasoning[J]. arXiv preprint arXiv:2110.08387, 2021.

[7] Zhou D, Schärli N, Hou L, et al. Least-to-most prompting enables complex reasoning in large language models[J]. arXiv preprint arXiv:2205.10625, 2022.

Citing promptbench and other research papers

Please cite us if you find this project helpful for your project/paper:

@article{zhu2023promptbench2,
  title={PromptBench: A Unified Library for Evaluation of Large Language Models},
  author={Zhu, Kaijie and Zhao, Qinlin and Chen, Hao and Wang, Jindong and Xie, Xing},
  journal={arXiv preprint arXiv:2312.07910},
  year={2023}
}

@article{zhu2023promptbench,
  title={PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts},
  author={Zhu, Kaijie and Wang, Jindong and Zhou, Jiaheng and Wang, Zichen and Chen, Hao and Wang, Yidong and Yang, Linyi and Ye, Wei and Gong, Neil Zhenqiang and Zhang, Yue and others},
  journal={arXiv preprint arXiv:2306.04528},
  year={2023}
}

@article{zhu2023dyval,
  title={DyVal: Graph-informed Dynamic Evaluation of Large Language Models},
  author={Zhu, Kaijie and Chen, Jiaao and Wang, Jindong and Gong, Neil Zhenqiang and Yang, Diyi and Xie, Xing},
  journal={arXiv preprint arXiv:2309.17167},
  year={2023}
}

@article{chang2023survey,
  title={A survey on evaluation of large language models},
  author={Chang, Yupeng and Wang, Xu and Wang, Jindong and Wu, Yuan and Zhu, Kaijie and Chen, Hao and Yang, Linyi and Yi, Xiaoyuan and Wang, Cunxiang and Wang, Yidong and others},
  journal={arXiv preprint arXiv:2307.03109},
  year={2023}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

If you have a suggestion that would make promptbench better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the project
  2. Create your branch (git checkout -b your_name/your_branch)
  3. Commit your changes (git commit -m 'Add some features')
  4. Push to the branch (git push origin your_name/your_branch)
  5. Open a Pull Request

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

promptbench's People

Contributors

dependabot[bot] avatar django-jiang avatar eltociear avatar hhhhhhao avatar icecream-and-tea avatar immortalise avatar jindongwang avatar madhavmathur avatar microsoftopensource avatar mingxuanxia avatar nabbisen avatar nadatelwazane avatar zhimin-z avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

promptbench's Issues

MMLU subset

There are 800+ samples in the MMLU subset (data/mmlu.json). How does it work?

How does it aligns with 564 samples as introduced in paper?

Potentially a bug in `LabelConstraint`

Hello,
Firstly, thanks for your contribution and open-sourcing you code!

I got the following error trying to reproduce an attack for the SST2 dataset:

  File "<PATH>/promptbench/prompt_attack/label_constraint.py", line 12, in <listcomp>
    self.labels = [label.lower() for label in labels]
AttributeError: 'int' object has no attribute 'lower'

Which makes sense since in the config the labels consist integers:

    'sst2': ['positive', 'negative', 'positive\'', 'negative\'', '0', '1', '0\'', '1\'',0, 1], 

But in the LabelConstraint constructor we have:

    def __init__(self, labels=[]):
        self.labels = [label.lower() for label in labels]

The changes were made in this PR.

To make it run we can simply delete the integers labels 0, 1 - tho I'm afraid the ground truth labels might be integers.
So another solution could be to apply.lower() only on string labels (which to me sounds like the correct approach).

What do you guys think?

About black-box attacks

hey, I want to know the details in the gpt attack.

Is your attack like this?:
For a sample which original is true. After you add an attack prompt and its output is wrong, then it means attack is successful?

Because in score-base attack, the output probabilities will decrease, like this: 1.0->0.8->0.5->0.4
And in your method, it is like that?: 1.0->0
Thanks

Some datasets are not available

I'm not sure if i'm missing something, but I tried loading the MMLU, SQuADv2 - but it always gives me an error (the json file is empty)

It does seem to fail to fetch from the URL provided here, since the data folder does not exist

        # check if the dataset exists, if not, download it
        self.filepath = os.path.join(self.data_dir, f"{dataset_name}.json")
        self.filepath2 = os.path.join(self.data_dir, f"{dataset_name}.jsonl")
        if not os.path.exists(self.filepath):
            if os.path.exists(self.filepath2):
                self.filepath = self.filepath2
            else:
                url = f'https://wjdcloud.blob.core.windows.net/dataset/promptbench/dataset/{dataset_name}.json'
                print(f"Downloading {dataset_name} dataset...")
                response = requests.get(url)
                with open(self.filepath, 'wb') as f:
                    f.write(response.content)

About Checklist attack method

Hi,

I was wondering why your implementation of Checklist attack method is different from the one in TextAttack package. Thank you for your time!

Running time

Hello, I was running with this command. It is not finished after a couple of hours. Do you have an estimation of how long does it take?

python main.py --model google/flan-t5-large --dataset mnli --attack textfooler --shot 0 --generate_len 20

Parsing the output of a model

The output of the model largely depends on the prompt. When performing text classification tasks such as sst2, which requires the model to give negative or positive responses, the model may provide various forms of answers, and the form of answers from different models may vary. Is there a good way to parse these results?

Interpretation of attack results

Thanks for this amazing repo on evaluating differet prompts, and I am keen to apply it to more LLMs 🤗

I tried to evaluate the prompt attack following your example, and obtained the following output:
(0.0, "Aѕ a sentiment classifier, determine whether the following text is 'positive' or 'negative'. Please classify: \nQuestion: {content}\nAnswer: ", 0.0, 0.0)

I would like to know how to intepretate these numbers, and more importantly how can I obtain the evaluation results like in promptbench/prompt_attack/README.md? Since I haven't found related information in your documents. Am i missing something?

Thank you very much!

About Runtime

Hi!Thanks for your awesome work!When I run the commend python main.py --model llama-13b --dataset mnli --attack bertattack --shot 0 --generate_len 20 with one NVIDIA 3090 24G, it takes already 20 hours and I don't know when it would finish. The logs is like:

2023-07-26 08:31:41,068 - INFO - gt: 2
2023-07-26 08:31:41,068 - INFO - Pred: -1
2023-07-26 08:31:41,068 - INFO - sentence: Assess the connection between the following sentences and count it as 'entailment', 'neutral', or 'contradiction':Premise: He hadn't seen even pictures of such things since the few silent movies run in some of the little art theaters. Hypothesis: He had recently seen pictures depicting those things. Answer: 

but the results file only contain a file named mnli which is also empty. Would you please give me a general time consuming estimation with your machines? Thanks a lot!

Enhancement Proposal for Dynamic Evaluation Framework Integration

Dear PromptBench Contributors,

I trust this message finds you in good health and high spirits. I am writing to propose an enhancement to the PromptBench library that I believe could significantly augment its utility for researchers and practitioners alike.

Having perused the documentation and utilised the library extensively, I have found the dynamic evaluation framework, DyVal, to be an invaluable asset for assessing the performance of Large Language Models (LLMs) in a more robust and realistic manner. However, I have observed that the integration of DyVal could be further refined to provide a more seamless user experience and to extend its capabilities.

To wit, I propose the following enhancements:

  1. Streamlined Integration: Simplifying the process of setting up and running DyVal within the PromptBench environment. This could involve automating the setup process and providing a more intuitive API that abstracts away some of the complexities involved in configuring the dynamic evaluation parameters.

  2. Extended Dataset Support: Expanding the range of datasets compatible with DyVal. While the current selection is commendable, incorporating additional datasets, particularly from niche domains, could broaden the scope of dynamic evaluation and provide deeper insights into model performance across diverse contexts.

  3. Custom Evaluation Metrics: Facilitating the integration of custom evaluation metrics within the DyVal framework. This would allow researchers to define and utilise metrics that are specifically tailored to their unique research questions or application requirements.

  4. Real-time Performance Monitoring: Introducing functionality for real-time monitoring of model performance during dynamic evaluation. This could include visual dashboards or logging mechanisms that provide immediate feedback on the model's performance trends over time.

  5. Guidance on Best Practices: Providing comprehensive documentation and examples that elucidate best practices for leveraging DyVal in various research scenarios. This could greatly assist users in maximising the potential of dynamic evaluation for their specific use cases.

I am keen to engage in further dialogue regarding these suggestions and explore how we might collaborate to bring these enhancements to fruition. I am convinced that by enriching the DyVal integration, PromptBench can offer even greater value to the community by enabling more nuanced and insightful evaluation of LLMs.

Thank you for considering my proposal. I eagerly anticipate your response and am hopeful for the opportunity to contribute to the evolution of PromptBench.

Best regards,
yihong1120

Numerous datasets exhibit a label of -1

Hello,

Attempting to load multiple datasets, I observed that several of them have all their elements labeled as -1.

For example:

import promptbench as pb
from tqdm import tqdm
dataset = pb.DatasetLoader.load_dataset("sst2")

print([e for e in tqdm(dataset) if e["label"] != -1])

Results: []
I'm using the promptbench version 0.0.2.

Thanks!

LLaMa 2 inference

I tried to benchmark LLaMa 2 chat model from HuggingFace and got a ValueError:
ValueError: temperature(=0) has to be a strictly positive float, otherwise your next token scores will be invalid.
It is caused by this line

Access to per-sample evaluation results

Hi,
Thanks for the great work! For my current project, I am looking to use the sample-wise evaluation results of VLMs for the experiments you have conducted.

If you can provide me with the sample-wise evaluation logs on the multimodal datasets mentioned(VQAv2, NoCaps, MMMU, MathVista, AI2D, ChartQA, ScienceQA) for the models evaluated(BLIP2, LLaVA Qwen-VL, Qwen-VL-Chat, InternLM-XComposer2-VL, GPT-4v, Gemini Pro Vision, Qwen-VL-Max, Qwen-VL-Plus), I would greatly appreciate it. If I missed a dataset or model, please feel free to incorporate them.

The ‘temperature’ parameter in pb.LLMModel needs to be of type float

1709090414095

"""
A class providing an interface for various language models.

This class supports creating and interfacing with different language models, handling prompt engineering, and performing model inference.

Parameters:
-----------
model : str
    The name of the model to be used.
max_new_tokens : int, optional
    The maximum number of new tokens to be generated (default is 20).
temperature : float, optional
    The temperature for text generation (default is 0).
model_dir : str or None, optional
    The directory containing the model files (default is None).
system_prompt : str or None, optional
    The system prompt to be used (default is None).
openai_key : str or None, optional
    The OpenAI API key, if required (default is None).
sleep_time : int, optional
    The sleep time between inference calls (default is 3).

typo

1710947458077 Check https://llm-eval.github.io/pages/leaderboard/dyval.html

Chatgpt

I want to replicate the results in the paper using the chatgpt model. But predict_by_openai_api method raises an error, i.e. NotImplementedError.

ChatGPT version

May we please confirm with you the exact ChatGPT version you used in the paper? Thanks.

Compatibility Request: Upgrade openai Dependency to Support >=1.10.0

Environment

  • promptbench version: 0.0.2
  • langchain-openai version: 0.1.1
  • openai version requested: 1.14.3
  • Python version: [Your Python version]
  • Operating System: [Your OS]

Description

I am currently working on a project that relies on promptbench along with langchain-openai. However, I've encountered a dependency conflict with the openai package. promptbench requires openai version 1.3.7, while langchain-openai is compatible with versions before 2.0.0 and at least 1.10.0. For my use case, I need features that are available in openai version 1.10.0 or later.

Issue

The current version constraint of promptbench for the openai package prevents me from using the newer features of the openai library that are necessary for my project. This version conflict is causing installation and compatibility issues in my development environment.

Request

Would it be possible to update the promptbench dependency on the openai library to be compatible with version 1.10.0 or later? This update would greatly benefit users who need the newer features from the openai library.

About Semantic Attacks Against Vicuna

Hi!

Thanks for your nice work!

One small question is that when I visited your demo site, and I choose "Vicuna" + "MNLI" + “Semantic” + “zero-shot task”, the site return nothing. Therefore, I am confused and could you please replenish these adversarial prompts? Many thanks!

Best wishes

pb-issue

How to use visualize.py to llama-2-7b?

Hi, it is a good job!

I want to know how to use visualize.py to llama-2-7b? I see that the code visualizes by calculating gradients rather than using the attention matrix. So in llama-2-7b, if I want to observe which words in the input sentence the model weights more heavily, what should I do? Do I need real labels? Can it be achieved by modifying the visualize.py code? Could you please give me some advice? Thank you very much!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.