potsawee / selfcheckgpt Goto Github PK

View Code? Open in Web Editor NEW

391.0 6.0 48.0 1.22 MB

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

License: MIT License

Python 100.00%

selfcheckgpt's People

Contributors

Stargazers

Watchers

selfcheckgpt's Issues

How to combine the three variants of selfcheckgpt in your paper?

Hi, @potsawee , thanks for sharing your nice work.

There is something unclear in your paper. I wonder if you can give some explanations?

After we obtain the three score, how to combine them for the final score? The three scores have different scale:

$\mathcal{S}_{BERT} \in [0.0, 1.0]$
$\mathcal{S}_{QA} \in [0.0, 1.0]$
$\mathcal{S}_{n-gram} \in [0.0, inf]$

I noticed that in section 5.4, you wrote "As a result, we consider SelfCheckGPT-Combination, which is a simple combination of the normalized scores of the three variants...."

So what is the combination strategy?

Thanks in advance! Looking forward to your reply.

What's the meaning of β in Appendix B?

In Appendix B, Eq. (17), the paper writes that:

Later the paper states that "We set both β_1 and β_2 to 0.8 ."
I was wondering why we can assume P(a=a_R|F) = (1-β_1), and why β_1 = 0.8?

Open Question: Fact Checking LLM

Hello,
On your repo, there is no discussion tab, so I am opening an issue as a question, and I hope that this is fine.

I found this package while looking into a method to check if the results I am getting from my LLM (google PaLM) are reliable. The questions I am asking are very specific, and I suspect many answers will be laced with hallucinations, which is understandable considering the training set and the narrow knowledge domain I am accessing with my question.

The questions I am asking are about educational institutes teaching a specific topic, by country, in pseudo-code:

Loop over all countries
By country, give a list of educational institutes teaching the specific topic
By educational institute, give the (degree) program in which the specific topic is taught
By program, give the modules part of this program

As you can see, this is a very specific line of questioning, and because we don't know if the model has documents specific on these parameters, I am looking for a method to verify that the answer to the third step is correct, because when false the list of modules is also false. Do you think your package can help me with this, or do you know there is another method more appropriate to check the results?

Thanks in advance for looking into this!

range of selfcheck_bertscore

Hello,
I am trying to use your work to estimate the factuality of samples. I am just getting relatively low scores for the selfcheck_bertscore even when the samples are totally contradicting. I was wondering how did you choose if a passage is factual or nonfactual.

Thank you

code for proxy model

Hello~ Could you please upload your code for llama proxy model results? many thanks!

Which version of LLaMA model is used?

As far as I know, the LLaMA model contains four versions: 7b, 13b, 33b and 65b. Which version does the figure refer to? Another question is what type of GPU is used to run llama_logrob_inference.py？

Code to reproduce the paper's evaluations

Thank you for the interesting paper, and for providing your models and data!

Do you have the code you used to produce the evaluation tables and plots in the paper, and would you be open to including it in this repo? That would help me a lot with an experiment I'm running.

Context:

I have been trying to reproduce the results from the paper, specifically the probability-based baselines for sentence-level factuality.

I'm using text-davinci-003 for logprobs, splitting the tokens into the provided sentences, and computing average and minimum logprobs over sentences. For each of {average logprob, minimum logprob}, I compute it in two ways:

Just using the logprob sent back from the API
Computing "probabilities" for the top 5 tokens based on the top 5 logprobs, then taking the log of these -- similar to how PPL5 is computed in the paper

I'm seeing very little relationship between any of these metrics and the annotations. My precision-recall curves look flat, with precision roughly equal to the base rate as recall ranges from 0 to 1 -- i.e., like random guessing.

It's possible (likely?) that I'm doing something wrong. I'll keep squinting at my code, but in the meantime, I thought I'd ask you guys for the code you used.

Another question: what is meant by Non-Factual* (with the *) in the paper? It appears in all the evals, but I can't find any explanatory text about it.

What are these 2 numbers means in the example result

Hi all, it's be a dumb question, I just wanted to know what does the two numbers mean in the example result? is that related to the length of sampled_passage? If the result is always going to be 2 numbers?

Possible annotation errors?

Hi there, awesome stuff, thanks for sharing.

Question: Are there possible annotation errors in the eval dataset (wiki_bio_gpt3_hallucination)?

Example: Observing example 6, index 11
Example: 6

array(['Akila Dananjaya (born 2 August 1995) is a Sri Lankan cricketer.',
       'He made his international debut for the Sri Lankan cricket team in August 2018.',
       'He is a right-arm off-spinner and right-handed batsman.',
       'Dananjaya made his first-class debut for Sri Lanka Army Sports Club in the 2013–14 Premier League Tournament.',
       'He was the leading wicket-taker in the tournament, taking 32 wickets in seven matches.',
       'He made his List A debut for Sri Lanka Army Sports Club in the 2014–15 Premier Limited Overs Tournament.',
       'In August 2018, he was named in the Sri Lankan squad for the 2018 Asia Cup.',
       'He made his One Day International (ODI) debut for Sri Lanka against Bangladesh on 15 September 2018.',
       "In October 2018, he was named in Sri Lanka's Test squad for their series against England, but he did not play.",
       "In December 2018, he was named in Sri Lanka's team for the 2018 ACC Emerging Teams Asia Cup.",
       'He was the leading wicket-taker for Sri Lanka in the tournament, with nine dismiss'],
      dtype=object)

index =11

He was the leading wicket-taker for Sri Lanka in the tournament, with nine dismiss

Ground Truth: major_inaccurate
Should be: accurate?
Observation:
During brief analysis, only NLI scores aggregated at the sentence level seem to agree with the ground truth?
The code is provided below.

from datasets import load_dataset
dataset = load_dataset("potsawee/wiki_bio_gpt3_hallucination")

eval_df = dataset['evaluation'].to_pandas()
eval_df.head()

# selecting the idx=5
_idx = 5
sentences = eval_df['gpt3_sentences'][_idx]
context = eval_df['gpt3_text'][5]

gt_labels = eval_df['annotation'][5]
print("labels:\n"{gt_labels})

print(f"Label for index: 11: {gt_labels[10]}")


context = eval_df['gpt3_text'][5]
print("Context:\n")
print(context)
print("\n\n")

print(f"**Eval sentence:**\n {sentences[10]}")
print("\n\n")

# Context broken into sentences
nlp = spacy.load("en_core_web_md")
context_samples = [sent.text.strip() for sent in nlp(context).sents]

# Passage:
nli_scores = selfcheck_nli.predict(
    sentences = [sentences[10]],                       
    sampled_passages = [context]
)
print(f"Passage (low score as expected): {nli_scores}")

nli_scores = selfcheck_nli.predict(
    sentences = [sentences[10]],                       
    sampled_passages = context_samples
)
print(f"Avg score on sentences is high: {nli_scores}")

specific_exp = """
In December 2018, he was named in Sri Lanka's team for the 2018 ACC Emerging Teams Asia Cup. He was the leading wicket-taker for Sri Lanka in the tournament, with nine dismiss
"""
nli_scores_specific = selfcheck_nli.predict(
    sentences = [sentences[10]],                       
    sampled_passages = [specific_exp]
)

print(f"Does it even work - comparing with the matched string (low score as expected)? : {nli_scores_specific}")


print("---- Using LLM - mistralai/Mistral-7B-Instruct-v0.2 ----") # quick eval purpose
from selfcheckgpt.modeling_selfcheck import SelfCheckLLMPrompt
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# mistralai/Mixtral-8x7B-Instruct-v0.
llm_model = "mistralai/Mistral-7B-Instruct-v0.2"

# Basic Prompt
# "Context: {context}\n\nSentence: {sentence}\n\nIs the sentence supported by the context above? Answer Yes or No.\n\nAnswer: "

selfcheck_prompt = SelfCheckLLMPrompt(llm_model, device)

context = eval_df['gpt3_text'][5]
print("Context:\n")
print(context)
print("\n\n")

print(f"**Eval sentence:**\n {sentences[10]}")
print("\n\n")

# Context broken into sentences
nlp = spacy.load("en_core_web_md")
context_samples = [sent.text.strip() for sent in nlp(context).sents]

# Passage:
prompt_scores = selfcheck_prompt.predict(
    sentences = [sentences[10]],                       
    sampled_passages = [context],
    verbose = True
)
print(f"Passage (low score as expected): {prompt_scores}")


prompt_scores = selfcheck_prompt.predict(
    sentences = [sentences[10]],                       
    sampled_passages = context_samples,
    verbose = True
)
print(f"Avg score on sentences (better than NLI): {prompt_scores}")

specific_exp = """
In December 2018, he was named in Sri Lanka's team for the 2018 ACC Emerging Teams Asia Cup. He was the leading wicket-taker for Sri Lanka in the tournament, with nine dismiss
"""
prompt_scores = selfcheck_prompt.predict(
    sentences = [sentences[10]],                       
    sampled_passages = [specific_exp],
    verbose = True
)

print(f"Does it even work - comparing with the matched string (low score as expected)?: {prompt_scores}")

Thoughts?

code for selfcheck n-gram

code for selfcheck n-gram to be included in this repository

Questions about the SelfCheckGPT-NLI and SelfCheckGPT-Prompt

Hi @potsawee, thank you for the impressive method and easy to use dataset. Two quick questions about the SelfCheckGPT-NLI and SelfCheckGPT-Prompt methods, which achieve quite impressive performance.

Regarding on the SelfCheckGPT-NLI method, I noticed that you use the "potsawee/deberta-v3-large-mnli" NLI model. Since in my previous practice I was using other NLI models, so I replace it with other NLI models including "microsoft/deberta-large-mnli" and "MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli". I also modify the contradiction probabilities following your suggestion by removing the probability of the class "neutral". However, the performance of the two NLI models seems a little weak. I test the AUC-PR on the "NonFact" setting with the first 20 passages in the dataset (around 400 sentences), and here is the performance:
- potsawee/deberta-v3-large-mnli: leads to AUC-PR 92.5
- microsoft/deberta-large-mnli: AUC-PR 89.7
- MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli: 88.6
  Does this phenomenon implies the SelfCheckGPT-NLI is relatively sensitive to the choices of the NLI model?
Regarding on the SelfCheckGPT-Prompt method, I see you have updated the code several hours ago. It seems the current code only support the open-sourced models instead of the gpt-3.5-turbo/Text-Davinci-003 used in the main paper (Table 2). Is there an estimated timeline to release the code for SelfCheckGPT-Prompt with gpt-3.5-turbo/Text-Davinci-003?

Thank you so much for your impressive work and effort!

Is this method actually useful in the real world?

Must sampled LLM: It takes so long
need to additional model: It is takes more GPU MEM
No scoring criteria: So, how much of the score is Hallucination?

What do you think? i think that is maybe useful test metric or something.
but, can not useful for real-time api service. (like, if i want to check hallucination, i don't response to client), i think that is needs to take additional hardware resource and response time....

about passage-level human annotations

could you show us the human annotation score result of each passage? (which is used to calculate the Pearson and Spearman)

Can you provide the datasets in the probability-baselines.ipynb file?

I can't find the data sets for these two parts, so I can't run the following code at all. Could you please provide it? Thank you very much!

Could you provide an example of using ngram to predict the factuality of a sentence?

Hi @potsawee , thanks for sharing your awesome work.

However, when trying to run your code, I found that even though there is an n-gram model, there are no examples of its usage provided. The n-gram model is very different from the others since its score ranges $[0, inf]$ while others range $[0,1]$. I tried using your method of normalizing it with $\frac{x-x_{\max}}{x_{\max}-x_{\min}}$, but this is probably not precise enough as the $x$ values during one output can all be similarly high or similarly low, thus not a reliable normalization method.

I wonder if my understanding has problems. Therefore, could you please share some better methods of normalizing, point out my problems, and share an example of evaluating the hallucination score using the n-gram method? Thanks.

When I run the latest version of 'probability-baselines.ipynb', there is an error. Could you please provide the full version number of the experimental environment?

When I run the code below：

The error message is below. I don't know what the reason is, but I guess it might be a version issue with ‘openai’.

The relevant version numbers of my lab environment are as follows:
python 3.8.18
typing_extensions 4.8.0
openai 1.2.4

Could you provide a example of showing the AUC-PR of your own methods(i.e. the selfcheckgpt) just like the probability-based-baselines?

Hi there! Thanks for your awesome works.
Just like what i said in the title:Could you provide a example of showing the AUC-PR of your own methods(i.e. the selfcheckgpt) just like the probability-based-baselines?

What's the meaning of Non-Factual* in the paper?

Same as the title. Thanks in advance.

can't find wiki_bio_test_idx indices in the original wikibio test set

Hello,

Which version of the wikibio dataset did you use?
I can't find the wiki_bio_test_idx indices in the wikipedia-biography-dataset/test/test.id file here

https://huggingface.co/datasets/wiki_bio/blob/main/data/wikipedia-biography-dataset.zip.

Can you provide the name list that is used for creating the dataset.

Without the names that are used, it is hard to carry out the experiment on other models.
"This is a Wikipedia
passage about {concept}".
For names, I mean the {concept} here.
I can't find it in the dataset.

Question about random baseline

Hi Potsawee,

Thank you for the nice open source directory and clear code! I have a question about the random baseline that you used for the AUC measure: in the notebook, you simply computed the average of the gold labels and took it as the random baseline. However, I'm not understanding the reasoning behind it. What exactly do you mean by a random baseline? The AUC value that we get by randomly guessing a value from a uniform distribution and comparing it to gold label? What is the meaning of it exactly?

Thank you and have a nice day!

How long would it usually take to run all three score?

Hi there, i tried to run the example of SelfCheckGPT Usage: BERTScore, QA, n-gram on Readme and it took 24mins. Is it expected?

Feedback: Adding Notebook for R and S generation by LLM

Really a great work. But one suggestion, as the current demo notebook assess an Hallucination of response given a set of samples by the user. But the actually work of the paper if i understood correctly, to assess whether the response from an LLM is hallucinated or not. I think it would be helpful to have a notebook/module where given an LLM with query the same method presented in the paper, first the R(response) is found with temperature=0 and N samples generated with the same LLM with temperature=1 and sampling technique. Then assess the factuality of the R with respect to {S1, S2, ... SN}.

My Idea includes allowing the user to give query and the Huggingface LLM as the input and the system does the whole work behind the frame and gives the Response R with the hallucination score wrt a method.

Contradiction Threshold for NLI Approach

Hi @potsawee, great paper and thanks for the easy to use library! One quick question, for the NLI results you show in this repo, what is the probability threshold you are using to determine factual vs. non-factual?

Thanks!

Can the proposed method be used for other domain?

Good job! I also wonder if you tried to use some other datasets to evaluate the proposed method? Or are there some other datasets like wiki_bio_gpt3_hallucination that I can test on? Thanks!

potsawee / selfcheckgpt Goto Github PK

selfcheckgpt's People

Contributors

Stargazers

Watchers

Forkers

selfcheckgpt's Issues

Recommend Projects

Recommend Topics

Recommend Org