Giter Club home page Giter Club logo

spuq's Introduction

Project Description

SPUQ: Perturbation-Based Uncertainty Quantification for Large Language Models

SPUQ is an LLM uncertainty calibration algorithm. It provides a confidence score for each query, for a given LLM. Experiments show that this confidence score is correlated with the generation accuracy, and therefore provides a useful LLM response evaluation metric on-the-fly.

The details of the approach are documented in our paper published at EACL-2024 Conference.

The basic idea is to check whether an LLM provides a significantly different answer when we ask the same question in a slightly different way. If it does, we assume the LLM is not confident in this case. SPUQ perturbs the input (including the prompt and the temperature) to get multiple outputs, and then aggregate the outputs to obtain the final confidence score. This allows SPUQ to address both epistemic (via perturbation) and aleatoric (via sampling) uncertainties, and it provides better calibration than some of the other existing methods.

Example

Here's an example to set up SPUQ. You can play with it using the notebook. The config includes:

  • The target LLM you'd like to calibrate. We use gpt-35-turbo-v0301 as an example
  • The perturbation method. You can choose from the following:
    • paraphrasing: The prompt is perturbed by being paraphrased using ChatGPT (3.5).
    • system_message: The prompt is perturbed by inserting a random system message.
    • dummy_token: The prompt is perturbed by inserting a random dummy token.
    • temperature: The temperature is perturbed with a random change.
  • The aggregation method. You can choose from the following:
  • The number of perturbed variants. Usually up to 5.
from spuq import SPUQ
from llms import LLM
llm = LLM('gpt-35-turbo-v0301')
spuq = SPUQ(llm=llm, perturbation='paraphrasing', aggregation='rougeL', n_perturb=5)

A low-confidence example

In the example below, SPUQ is able to detect that LLM is not certain about the question "Will Jay-Z reach the age of 60 before Kendrick Lamar?". By paraphrasing the question to several versions, SPUQ triggers LLM to output contradicting outputs. As a result, the final confidence score is low (~0.256)

question = 'Will Jay-Z reach the age of 60 before Kendrick Lamar?'
messages = [{'role': 'user', 'content': question}]
spuq.run(messages, temperature=0.7)

The SPUQ report:

{"perturbed": [[[{"role": "user",
     "content": "Is Kendrick Lamar going to turn 60 before Jay-Z?"}],
   0.7],
  [[{"role": "user",
     "content": "Who will turn 60 first, Jay-Z or Kendrick Lamar?"}],
   0.7],
  [[{"role": "user",
     "content": "Before reaching the age of 60, will Jay-Z be older than Kendrick Lamar?"}],
   0.7],
  [[{"role": "user", "content": "Will Kendrick Lamar hit 60 after Jay-Z?"}],
   0.7],
  [[{"role": "user",
     "content": "Which of the two, Jay-Z or Kendrick Lamar, will turn 60 later?"}],
   0.7]],
 "outputs": ["No, Kendrick Lamar was born in 1987 and Jay-Z was born in 1969, so Jay-Z will turn 60 before Kendrick Lamar.",
  "Jay-Z will turn 60 first. He was born on December 4, 1969, while Kendrick Lamar was born on June 17, 1987.",
  "Yes, Jay-Z is already older than Kendrick Lamar. Jay-Z was born on December 4, 1969, while Kendrick Lamar was born on June 17, 1987.",
  "I'm sorry, as an AI language model, I cannot predict the future.",
  "Jay-Z will turn 60 later."],
 "confidence": 0.25582140902337946}

A high-confidence example

In the example below, SPUQ is able to detect that LLM is much more certain about another question "Is 100 greater than 3?". Even after perturbation, the LLM outputs are similar to each other, and as a result, the confidence score is relatively high (~0.538)

question = 'Is 100 greater than 3?'
messages = [{'role': 'user', 'content': question}]
spuq.run(messages, temperature=0.7)

The SPUQ report:

{"perturbed": [[[{"role": "user", "content": "Does 3 fall short of 100?"}],
   0.7],
  [[{"role": "user", "content": "Is 3 less than 100?"}], 0.7],
  [[{"role": "user", "content": "Is the number 100 bigger than 3?"}], 0.7],
  [[{"role": "user", "content": "Would 100 be considered greater than 3?"}],
   0.7],
  [[{"role": "user", "content": "Does 100 exceed 3 in value?"}], 0.7]],
 "outputs": ["Yes, 3 is much less than 100.",
  "Yes, 3 is less than 100.",
  "Yes, 100 is bigger than 3.",
  "Yes, 100 is greater than 3.",
  "Yes, 100 exceeds 3 in value."],
 "confidence": 0.5384615384615384}

Usage Steps

Installation

This repo is tested with Python 3.11. Please refer to the requirements.txt for details. Or install the following packages with pip

pip install numpy     
pip install sentence_transformers 
pip install rouge-score  
pip install evaluate  
pip install bert-score 
pip install openai

Usage

Please see the notebook for details.

Dataset Description

Please read our paper for details.

How to cite

@inproceedings{gao2024spuq,
  title={SPUQ: Perturbation-Based Uncertainty Quantification for Large Language Models},
  author={Gao, Xiang and Zhang, Jiaxin and Mouatadid, Lalla and Das, Kamalika},
  booktitle={Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={2336--2346},
  year={2024}
}

spuq's People

Contributors

golsun avatar asreepathy avatar

Stargazers

Dave Lester avatar Huaizhi Qu avatar Nir Zilberman avatar  avatar Marc Jermaine Pontiveros avatar  avatar Aiden Han avatar Lau Van Kiet avatar Jiaxin Zhang avatar Jiawei Xu(许家伟) avatar

Watchers

 avatar Lucy avatar  avatar Jiaxin Zhang avatar Kostas Georgiou avatar

spuq's Issues

precision result in demo.ipynb

Thank you for your work!

But I have a question when running demo.ipynb:
I got results as below and the confidence score does not make common sense.
{'perturbed': [([{'role': 'user', 'content': 'Is Jay-Z going to turn 60 before Kendrick Lamar?'}], 0.7), ([{'role': 'user', 'content': 'Will Kendrick Lamar turn 60 after Jay-Z?'}], 0.7), ([{'role': 'user', 'content': 'Is it true that Jay-Z will reach 60 years old before Kendrick Lamar?'}], 0.7), ([{'role': 'user', 'content': 'Will Jay-Z hit the age of 60 before Kendrick Lamar does?'}], 0.7), ([{'role': 'user', 'content': 'Does Kendrick Lamar have to wait longer than Jay-Z to turn 60?'}], 0.7)], 'outputs': ['Yes, Jay-Z will turn 60 before Kendrick Lamar. Jay-Z was born on December 4, 1969, and Kendrick Lamar was born on June 17, 1987.', 'Yes, Kendrick Lamar will turn 60 after Jay-Z. Jay-Z was born on December 4, 1969, and Kendrick Lamar was born on June 17, 1987, which makes him younger than Jay-Z.', "Yes, it's true. Jay-Z was born on December 4, 1969, and Kendrick Lamar was born on June 17, 1987. So, Jay-Z will reach 60 years old before Kendrick Lamar.", 'Yes, Jay-Z will hit the age of 60 before Kendrick Lamar does. Jay-Z was born on December 4, 1969, while Kendrick Lamar was born on June 17, 1987.', 'Yes, Kendrick Lamar has to wait longer than Jay-Z to turn 60 because he is younger. Kendrick Lamar was born in 1987 and Jay-Z was born in 1969.'], 'confidence': 0.65456708691659} {'perturbed': [([{'role': 'user', 'content': 'Does 3 fall short of 100?'}], 0.7), ([{'role': 'user', 'content': 'Is 3 smaller than 100?'}], 0.7), ([{'role': 'user', 'content': 'Would 100 be considered larger than 3?'}], 0.7), ([{'role': 'user', 'content': 'Does 100 exceed 3?'}], 0.7), ([{'role': 'user', 'content': 'Is it true that 100 is more than 3?'}], 0.7)], 'outputs': ['Yes, 3 falls short of 100 by 97.', 'Yes, 3 is smaller than 100.', 'Yes, 100 would be considered larger than 3.', 'Yes, 100 exceeds 3.', 'Yes, it is true that 100 is more than 3.'], 'confidence': 0.30853174603174605}

The former is bigger than the latter.
Do I make some mistakes to cause this problem?

Hope you for your reply. Thank you again for your great effort!

inter-sample confidence

May I ask how is the $$w_i = s(x_0, x_i)$$ in Eq. 1 computed in the code? It seems like there's no reweighing in the InterSampleAggregation class.

`requirements.txt` includes local paths?

Thank you for your codes!

But I meet one problem when installing environments using pip install -r requirements.txt.
And I found this line as below:
appnope @ file:///home/conda/feedstock_root/build_artifacts/appnope_1707233003401/work asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1698341106958/work

Are these packages in your local path?
Can I install these packages directly using pip install appnope from standard PyPI ?

Thank you and hope for your kind reply.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.