Giter Club home page Giter Club logo

selfcheckgpt's People

Contributors

adianliusie avatar potsawee avatar wladimirlct avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

selfcheckgpt's Issues

How to combine the three variants of selfcheckgpt in your paper?

Hi, @potsawee , thanks for sharing your nice work.

There is something unclear in your paper. I wonder if you can give some explanations?

After we obtain the three score, how to combine them for the final score? The three scores have different scale:

  • $\mathcal{S}_{BERT} \in [0.0, 1.0]$
  • $\mathcal{S}_{QA} \in [0.0, 1.0]$
  • $\mathcal{S}_{n-gram} \in [0.0, inf]$

I noticed that in section 5.4, you wrote "As a result, we consider SelfCheckGPT-Combination, which is a simple combination of the normalized scores of the three variants...."

So what is the combination strategy?

Thanks in advance! Looking forward to your reply.

What's the meaning of β in Appendix B?

In Appendix B, Eq. (17), the paper writes that:
image
Later the paper states that "We set both β_1 and β_2 to 0.8 ."
I was wondering why we can assume P(a=a_R|F) = (1-β_1), and why β_1 = 0.8?

Open Question: Fact Checking LLM

Hello,
On your repo, there is no discussion tab, so I am opening an issue as a question, and I hope that this is fine.

I found this package while looking into a method to check if the results I am getting from my LLM (google PaLM) are reliable. The questions I am asking are very specific, and I suspect many answers will be laced with hallucinations, which is understandable considering the training set and the narrow knowledge domain I am accessing with my question.

The questions I am asking are about educational institutes teaching a specific topic, by country, in pseudo-code:

  1. Loop over all countries
  2. By country, give a list of educational institutes teaching the specific topic
  3. By educational institute, give the (degree) program in which the specific topic is taught
  4. By program, give the modules part of this program

As you can see, this is a very specific line of questioning, and because we don't know if the model has documents specific on these parameters, I am looking for a method to verify that the answer to the third step is correct, because when false the list of modules is also false. Do you think your package can help me with this, or do you know there is another method more appropriate to check the results?

Thanks in advance for looking into this!

range of selfcheck_bertscore

Hello,
I am trying to use your work to estimate the factuality of samples. I am just getting relatively low scores for the selfcheck_bertscore even when the samples are totally contradicting. I was wondering how did you choose if a passage is factual or nonfactual.

Thank you

code for proxy model

Hello~ Could you please upload your code for llama proxy model results? many thanks!

Which version of LLaMA model is used?

image
As far as I know, the LLaMA model contains four versions: 7b, 13b, 33b and 65b. Which version does the figure refer to? Another question is what type of GPU is used to run llama_logrob_inference.py?

Code to reproduce the paper's evaluations

Thank you for the interesting paper, and for providing your models and data!

Do you have the code you used to produce the evaluation tables and plots in the paper, and would you be open to including it in this repo? That would help me a lot with an experiment I'm running.


Context:

I have been trying to reproduce the results from the paper, specifically the probability-based baselines for sentence-level factuality.

I'm using text-davinci-003 for logprobs, splitting the tokens into the provided sentences, and computing average and minimum logprobs over sentences. For each of {average logprob, minimum logprob}, I compute it in two ways:

  1. Just using the logprob sent back from the API
  2. Computing "probabilities" for the top 5 tokens based on the top 5 logprobs, then taking the log of these -- similar to how PPL5 is computed in the paper

I'm seeing very little relationship between any of these metrics and the annotations. My precision-recall curves look flat, with precision roughly equal to the base rate as recall ranges from 0 to 1 -- i.e., like random guessing.

It's possible (likely?) that I'm doing something wrong. I'll keep squinting at my code, but in the meantime, I thought I'd ask you guys for the code you used.


Another question: what is meant by Non-Factual* (with the *) in the paper? It appears in all the evals, but I can't find any explanatory text about it.

What are these 2 numbers means in the example result

Hi all, it's be a dumb question, I just wanted to know what does the two numbers mean in the example result? is that related to the length of sampled_passage? If the result is always going to be 2 numbers?

Possible annotation errors?

Hi there, awesome stuff, thanks for sharing.

Question: Are there possible annotation errors in the eval dataset (wiki_bio_gpt3_hallucination)?

Example: Observing example 6, index 11
Example: 6

array(['Akila Dananjaya (born 2 August 1995) is a Sri Lankan cricketer.',
       'He made his international debut for the Sri Lankan cricket team in August 2018.',
       'He is a right-arm off-spinner and right-handed batsman.',
       'Dananjaya made his first-class debut for Sri Lanka Army Sports Club in the 2013–14 Premier League Tournament.',
       'He was the leading wicket-taker in the tournament, taking 32 wickets in seven matches.',
       'He made his List A debut for Sri Lanka Army Sports Club in the 2014–15 Premier Limited Overs Tournament.',
       'In August 2018, he was named in the Sri Lankan squad for the 2018 Asia Cup.',
       'He made his One Day International (ODI) debut for Sri Lanka against Bangladesh on 15 September 2018.',
       "In October 2018, he was named in Sri Lanka's Test squad for their series against England, but he did not play.",
       "In December 2018, he was named in Sri Lanka's team for the 2018 ACC Emerging Teams Asia Cup.",
       'He was the leading wicket-taker for Sri Lanka in the tournament, with nine dismiss'],
      dtype=object)

index =11

He was the leading wicket-taker for Sri Lanka in the tournament, with nine dismiss

Ground Truth: major_inaccurate
Should be: accurate?
Observation:
During brief analysis, only NLI scores aggregated at the sentence level seem to agree with the ground truth?
The code is provided below.

from datasets import load_dataset
dataset = load_dataset("potsawee/wiki_bio_gpt3_hallucination")

eval_df = dataset['evaluation'].to_pandas()
eval_df.head()

# selecting the idx=5
_idx = 5
sentences = eval_df['gpt3_sentences'][_idx]
context = eval_df['gpt3_text'][5]

gt_labels = eval_df['annotation'][5]
print("labels:\n"{gt_labels})

print(f"Label for index: 11: {gt_labels[10]}")


context = eval_df['gpt3_text'][5]
print("Context:\n")
print(context)
print("\n\n")

print(f"**Eval sentence:**\n {sentences[10]}")
print("\n\n")

# Context broken into sentences
nlp = spacy.load("en_core_web_md")
context_samples = [sent.text.strip() for sent in nlp(context).sents]

# Passage:
nli_scores = selfcheck_nli.predict(
    sentences = [sentences[10]],                       
    sampled_passages = [context]
)
print(f"Passage (low score as expected): {nli_scores}")

nli_scores = selfcheck_nli.predict(
    sentences = [sentences[10]],                       
    sampled_passages = context_samples
)
print(f"Avg score on sentences is high: {nli_scores}")

specific_exp = """
In December 2018, he was named in Sri Lanka's team for the 2018 ACC Emerging Teams Asia Cup. He was the leading wicket-taker for Sri Lanka in the tournament, with nine dismiss
"""
nli_scores_specific = selfcheck_nli.predict(
    sentences = [sentences[10]],                       
    sampled_passages = [specific_exp]
)

print(f"Does it even work - comparing with the matched string (low score as expected)? : {nli_scores_specific}")


print("---- Using LLM - mistralai/Mistral-7B-Instruct-v0.2 ----") # quick eval purpose
from selfcheckgpt.modeling_selfcheck import SelfCheckLLMPrompt
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# mistralai/Mixtral-8x7B-Instruct-v0.
llm_model = "mistralai/Mistral-7B-Instruct-v0.2"

# Basic Prompt
# "Context: {context}\n\nSentence: {sentence}\n\nIs the sentence supported by the context above? Answer Yes or No.\n\nAnswer: "

selfcheck_prompt = SelfCheckLLMPrompt(llm_model, device)

context = eval_df['gpt3_text'][5]
print("Context:\n")
print(context)
print("\n\n")

print(f"**Eval sentence:**\n {sentences[10]}")
print("\n\n")

# Context broken into sentences
nlp = spacy.load("en_core_web_md")
context_samples = [sent.text.strip() for sent in nlp(context).sents]

# Passage:
prompt_scores = selfcheck_prompt.predict(
    sentences = [sentences[10]],                       
    sampled_passages = [context],
    verbose = True
)
print(f"Passage (low score as expected): {prompt_scores}")


prompt_scores = selfcheck_prompt.predict(
    sentences = [sentences[10]],                       
    sampled_passages = context_samples,
    verbose = True
)
print(f"Avg score on sentences (better than NLI): {prompt_scores}")

specific_exp = """
In December 2018, he was named in Sri Lanka's team for the 2018 ACC Emerging Teams Asia Cup. He was the leading wicket-taker for Sri Lanka in the tournament, with nine dismiss
"""
prompt_scores = selfcheck_prompt.predict(
    sentences = [sentences[10]],                       
    sampled_passages = [specific_exp],
    verbose = True
)

print(f"Does it even work - comparing with the matched string (low score as expected)?: {prompt_scores}")

Thoughts?

Questions about the SelfCheckGPT-NLI and SelfCheckGPT-Prompt

Hi @potsawee, thank you for the impressive method and easy to use dataset. Two quick questions about the SelfCheckGPT-NLI and SelfCheckGPT-Prompt methods, which achieve quite impressive performance.

  • Regarding on the SelfCheckGPT-NLI method, I noticed that you use the "potsawee/deberta-v3-large-mnli" NLI model. Since in my previous practice I was using other NLI models, so I replace it with other NLI models including "microsoft/deberta-large-mnli" and "MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli". I also modify the contradiction probabilities following your suggestion by removing the probability of the class "neutral". However, the performance of the two NLI models seems a little weak. I test the AUC-PR on the "NonFact" setting with the first 20 passages in the dataset (around 400 sentences), and here is the performance:

    • potsawee/deberta-v3-large-mnli: leads to AUC-PR 92.5
    • microsoft/deberta-large-mnli: AUC-PR 89.7
    • MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli: 88.6
      Does this phenomenon implies the SelfCheckGPT-NLI is relatively sensitive to the choices of the NLI model?
  • Regarding on the SelfCheckGPT-Prompt method, I see you have updated the code several hours ago. It seems the current code only support the open-sourced models instead of the gpt-3.5-turbo/Text-Davinci-003 used in the main paper (Table 2). Is there an estimated timeline to release the code for SelfCheckGPT-Prompt with gpt-3.5-turbo/Text-Davinci-003?

Thank you so much for your impressive work and effort!

Is this method actually useful in the real world?

  1. Must sampled LLM: It takes so long
  2. need to additional model: It is takes more GPU MEM
  3. No scoring criteria: So, how much of the score is Hallucination?

What do you think? i think that is maybe useful test metric or something.
but, can not useful for real-time api service. (like, if i want to check hallucination, i don't response to client), i think that is needs to take additional hardware resource and response time....

Could you provide an example of using ngram to predict the factuality of a sentence?

Hi @potsawee , thanks for sharing your awesome work.

However, when trying to run your code, I found that even though there is an n-gram model, there are no examples of its usage provided. The n-gram model is very different from the others since its score ranges $[0, inf]$ while others range $[0,1]$. I tried using your method of normalizing it with $\frac{x-x_{\max}}{x_{\max}-x_{\min}}$, but this is probably not precise enough as the $x$ values during one output can all be similarly high or similarly low, thus not a reliable normalization method.

I wonder if my understanding has problems. Therefore, could you please share some better methods of normalizing, point out my problems, and share an example of evaluating the hallucination score using the n-gram method? Thanks.

Question about random baseline

Hi Potsawee,

Thank you for the nice open source directory and clear code! I have a question about the random baseline that you used for the AUC measure: in the notebook, you simply computed the average of the gold labels and took it as the random baseline. However, I'm not understanding the reasoning behind it. What exactly do you mean by a random baseline? The AUC value that we get by randomly guessing a value from a uniform distribution and comparing it to gold label? What is the meaning of it exactly?

Thank you and have a nice day!

Feedback: Adding Notebook for R and S generation by LLM

Really a great work. But one suggestion, as the current demo notebook assess an Hallucination of response given a set of samples by the user. But the actually work of the paper if i understood correctly, to assess whether the response from an LLM is hallucinated or not. I think it would be helpful to have a notebook/module where given an LLM with query the same method presented in the paper, first the R(response) is found with temperature=0 and N samples generated with the same LLM with temperature=1 and sampling technique. Then assess the factuality of the R with respect to {S1, S2, ... SN}.

My Idea includes allowing the user to give query and the Huggingface LLM as the input and the system does the whole work behind the frame and gives the Response R with the hallucination score wrt a method.

Contradiction Threshold for NLI Approach

Hi @potsawee, great paper and thanks for the easy to use library! One quick question, for the NLI results you show in this repo, what is the probability threshold you are using to determine factual vs. non-factual?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.