potsawee / selfcheckgpt Goto Github PK
View Code? Open in Web Editor NEWSelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
License: MIT License
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
License: MIT License
Hi, @potsawee , thanks for sharing your nice work.
There is something unclear in your paper. I wonder if you can give some explanations?
After we obtain the three score, how to combine them for the final score? The three scores have different scale:
I noticed that in section 5.4, you wrote "As a result, we consider SelfCheckGPT-Combination, which is a simple combination of the normalized scores of the three variants...."
So what is the combination strategy?
Thanks in advance! Looking forward to your reply.
Hello,
On your repo, there is no discussion tab, so I am opening an issue as a question, and I hope that this is fine.
I found this package while looking into a method to check if the results I am getting from my LLM (google PaLM) are reliable. The questions I am asking are very specific, and I suspect many answers will be laced with hallucinations, which is understandable considering the training set and the narrow knowledge domain I am accessing with my question.
The questions I am asking are about educational institutes teaching a specific topic, by country, in pseudo-code:
As you can see, this is a very specific line of questioning, and because we don't know if the model has documents specific on these parameters, I am looking for a method to verify that the answer to the third step is correct, because when false the list of modules is also false. Do you think your package can help me with this, or do you know there is another method more appropriate to check the results?
Thanks in advance for looking into this!
Hello,
I am trying to use your work to estimate the factuality of samples. I am just getting relatively low scores for the selfcheck_bertscore even when the samples are totally contradicting. I was wondering how did you choose if a passage is factual or nonfactual.
Thank you
Hello~ Could you please upload your code for llama proxy model results? many thanks!
Thank you for the interesting paper, and for providing your models and data!
Do you have the code you used to produce the evaluation tables and plots in the paper, and would you be open to including it in this repo? That would help me a lot with an experiment I'm running.
Context:
I have been trying to reproduce the results from the paper, specifically the probability-based baselines for sentence-level factuality.
I'm using text-davinci-003
for logprobs, splitting the tokens into the provided sentences, and computing average and minimum logprobs over sentences. For each of {average logprob, minimum logprob}, I compute it in two ways:
I'm seeing very little relationship between any of these metrics and the annotations. My precision-recall curves look flat, with precision roughly equal to the base rate as recall ranges from 0 to 1 -- i.e., like random guessing.
It's possible (likely?) that I'm doing something wrong. I'll keep squinting at my code, but in the meantime, I thought I'd ask you guys for the code you used.
Another question: what is meant by Non-Factual* (with the *) in the paper? It appears in all the evals, but I can't find any explanatory text about it.
Hi all, it's be a dumb question, I just wanted to know what does the two numbers mean in the example result? is that related to the length of sampled_passage? If the result is always going to be 2 numbers?
Hi there, awesome stuff, thanks for sharing.
Question: Are there possible annotation errors in the eval dataset (wiki_bio_gpt3_hallucination)?
Example: Observing example 6, index 11
Example: 6
array(['Akila Dananjaya (born 2 August 1995) is a Sri Lankan cricketer.',
'He made his international debut for the Sri Lankan cricket team in August 2018.',
'He is a right-arm off-spinner and right-handed batsman.',
'Dananjaya made his first-class debut for Sri Lanka Army Sports Club in the 2013–14 Premier League Tournament.',
'He was the leading wicket-taker in the tournament, taking 32 wickets in seven matches.',
'He made his List A debut for Sri Lanka Army Sports Club in the 2014–15 Premier Limited Overs Tournament.',
'In August 2018, he was named in the Sri Lankan squad for the 2018 Asia Cup.',
'He made his One Day International (ODI) debut for Sri Lanka against Bangladesh on 15 September 2018.',
"In October 2018, he was named in Sri Lanka's Test squad for their series against England, but he did not play.",
"In December 2018, he was named in Sri Lanka's team for the 2018 ACC Emerging Teams Asia Cup.",
'He was the leading wicket-taker for Sri Lanka in the tournament, with nine dismiss'],
dtype=object)
index =11
He was the leading wicket-taker for Sri Lanka in the tournament, with nine dismiss
Ground Truth: major_inaccurate
Should be: accurate
?
Observation:
During brief analysis, only NLI scores aggregated at the sentence level seem to agree with the ground truth?
The code is provided below.
from datasets import load_dataset
dataset = load_dataset("potsawee/wiki_bio_gpt3_hallucination")
eval_df = dataset['evaluation'].to_pandas()
eval_df.head()
# selecting the idx=5
_idx = 5
sentences = eval_df['gpt3_sentences'][_idx]
context = eval_df['gpt3_text'][5]
gt_labels = eval_df['annotation'][5]
print("labels:\n"{gt_labels})
print(f"Label for index: 11: {gt_labels[10]}")
context = eval_df['gpt3_text'][5]
print("Context:\n")
print(context)
print("\n\n")
print(f"**Eval sentence:**\n {sentences[10]}")
print("\n\n")
# Context broken into sentences
nlp = spacy.load("en_core_web_md")
context_samples = [sent.text.strip() for sent in nlp(context).sents]
# Passage:
nli_scores = selfcheck_nli.predict(
sentences = [sentences[10]],
sampled_passages = [context]
)
print(f"Passage (low score as expected): {nli_scores}")
nli_scores = selfcheck_nli.predict(
sentences = [sentences[10]],
sampled_passages = context_samples
)
print(f"Avg score on sentences is high: {nli_scores}")
specific_exp = """
In December 2018, he was named in Sri Lanka's team for the 2018 ACC Emerging Teams Asia Cup. He was the leading wicket-taker for Sri Lanka in the tournament, with nine dismiss
"""
nli_scores_specific = selfcheck_nli.predict(
sentences = [sentences[10]],
sampled_passages = [specific_exp]
)
print(f"Does it even work - comparing with the matched string (low score as expected)? : {nli_scores_specific}")
print("---- Using LLM - mistralai/Mistral-7B-Instruct-v0.2 ----") # quick eval purpose
from selfcheckgpt.modeling_selfcheck import SelfCheckLLMPrompt
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# mistralai/Mixtral-8x7B-Instruct-v0.
llm_model = "mistralai/Mistral-7B-Instruct-v0.2"
# Basic Prompt
# "Context: {context}\n\nSentence: {sentence}\n\nIs the sentence supported by the context above? Answer Yes or No.\n\nAnswer: "
selfcheck_prompt = SelfCheckLLMPrompt(llm_model, device)
context = eval_df['gpt3_text'][5]
print("Context:\n")
print(context)
print("\n\n")
print(f"**Eval sentence:**\n {sentences[10]}")
print("\n\n")
# Context broken into sentences
nlp = spacy.load("en_core_web_md")
context_samples = [sent.text.strip() for sent in nlp(context).sents]
# Passage:
prompt_scores = selfcheck_prompt.predict(
sentences = [sentences[10]],
sampled_passages = [context],
verbose = True
)
print(f"Passage (low score as expected): {prompt_scores}")
prompt_scores = selfcheck_prompt.predict(
sentences = [sentences[10]],
sampled_passages = context_samples,
verbose = True
)
print(f"Avg score on sentences (better than NLI): {prompt_scores}")
specific_exp = """
In December 2018, he was named in Sri Lanka's team for the 2018 ACC Emerging Teams Asia Cup. He was the leading wicket-taker for Sri Lanka in the tournament, with nine dismiss
"""
prompt_scores = selfcheck_prompt.predict(
sentences = [sentences[10]],
sampled_passages = [specific_exp],
verbose = True
)
print(f"Does it even work - comparing with the matched string (low score as expected)?: {prompt_scores}")
Thoughts?
code for selfcheck n-gram to be included in this repository
Hi @potsawee, thank you for the impressive method and easy to use dataset. Two quick questions about the SelfCheckGPT-NLI and SelfCheckGPT-Prompt methods, which achieve quite impressive performance.
Regarding on the SelfCheckGPT-NLI method, I noticed that you use the "potsawee/deberta-v3-large-mnli" NLI model. Since in my previous practice I was using other NLI models, so I replace it with other NLI models including "microsoft/deberta-large-mnli" and "MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli". I also modify the contradiction probabilities following your suggestion by removing the probability of the class "neutral". However, the performance of the two NLI models seems a little weak. I test the AUC-PR on the "NonFact" setting with the first 20 passages in the dataset (around 400 sentences), and here is the performance:
Regarding on the SelfCheckGPT-Prompt method, I see you have updated the code several hours ago. It seems the current code only support the open-sourced models instead of the gpt-3.5-turbo/Text-Davinci-003 used in the main paper (Table 2). Is there an estimated timeline to release the code for SelfCheckGPT-Prompt with gpt-3.5-turbo/Text-Davinci-003?
Thank you so much for your impressive work and effort!
What do you think? i think that is maybe useful test metric or something.
but, can not useful for real-time api service. (like, if i want to check hallucination, i don't response to client), i think that is needs to take additional hardware resource and response time....
could you show us the human annotation score result of each passage? (which is used to calculate the Pearson and Spearman)
Hi @potsawee , thanks for sharing your awesome work.
However, when trying to run your code, I found that even though there is an n-gram model, there are no examples of its usage provided. The n-gram model is very different from the others since its score ranges
I wonder if my understanding has problems. Therefore, could you please share some better methods of normalizing, point out my problems, and share an example of evaluating the hallucination score using the n-gram method? Thanks.
Hi there! Thanks for your awesome works.
Just like what i said in the title:Could you provide a example of showing the AUC-PR of your own methods(i.e. the selfcheckgpt) just like the probability-based-baselines?
Same as the title. Thanks in advance.
Hello,
Which version of the wikibio dataset did you use?
I can't find the wiki_bio_test_idx indices in the wikipedia-biography-dataset/test/test.id
file here
https://huggingface.co/datasets/wiki_bio/blob/main/data/wikipedia-biography-dataset.zip.
Without the names that are used, it is hard to carry out the experiment on other models.
"This is a Wikipedia
passage about {concept}".
For names, I mean the {concept} here.
I can't find it in the dataset.
Hi Potsawee,
Thank you for the nice open source directory and clear code! I have a question about the random baseline that you used for the AUC measure: in the notebook, you simply computed the average of the gold labels and took it as the random baseline. However, I'm not understanding the reasoning behind it. What exactly do you mean by a random baseline? The AUC value that we get by randomly guessing a value from a uniform distribution and comparing it to gold label? What is the meaning of it exactly?
Thank you and have a nice day!
Hi there, i tried to run the example of SelfCheckGPT Usage: BERTScore, QA, n-gram on Readme and it took 24mins. Is it expected?
Really a great work. But one suggestion, as the current demo notebook assess an Hallucination of response given a set of samples by the user. But the actually work of the paper if i understood correctly, to assess whether the response from an LLM is hallucinated or not. I think it would be helpful to have a notebook/module where given an LLM with query the same method presented in the paper, first the R(response) is found with temperature=0 and N samples generated with the same LLM with temperature=1 and sampling technique. Then assess the factuality of the R with respect to {S1, S2, ... SN}.
My Idea includes allowing the user to give query and the Huggingface LLM as the input and the system does the whole work behind the frame and gives the Response R with the hallucination score wrt a method.
Hi @potsawee, great paper and thanks for the easy to use library! One quick question, for the NLI results you show in this repo, what is the probability threshold you are using to determine factual vs. non-factual?
Thanks!
Good job! I also wonder if you tried to use some other datasets to evaluate the proposed method? Or are there some other datasets like wiki_bio_gpt3_hallucination that I can test on? Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.