Giter Club home page Giter Club logo

rgb's People

Contributors

chen700564 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

rgb's Issues

Some issues trying Rejection Rate

Hi! Congratulations for the amazing work you've done.

I'm trying to use RGB for testing my RAG pipeline, but I'm having the following issue with Rejection Rate: even if I only feed the model with the negative examples, it will still sometimes answer correctly, because it turns out some of the negative examples in fact have the correct answer...

For example:

Question: How much did Elon Musk bought Twitter?
Correct answer (according to RGB): [44 billion]
My model answer: Elon Musk bought Twitter at his original offer price of $54.20 a share, with a total cost of roughly $44 billion.

The model was expected to REJECT to answer this, since I only gave it negative examples, right? But in fact, there are a few negative documents that have this answer:

  • Oct 28, 2022 ... Elon Musk takes control of Twitter in $44bn deal · What next for Twitter under Elon Musk? · How the world's richest person bought Twitter · Who is\xa0...

  • After building a stake in Twitter at the start of the year, Mr Musk made his $44bn offer in April, a price tag that looked too high almost as soon as it was agreed. He...

Is it an expected behavior or am I doing something wrong?

thanks in advance

How you create the data

Your work is very excellent.
I would like to know how you create your data, for example, how the "en_fact.json" is created, I noticed that there are positive and negative samples, how these samples are created, is it created manually or just automatically.

Looking forward to receiving your reply.

RAG's future

RAG is certainly promising, but do you think that RAG can be used by a company in general to address customer or product concerns?

rejection rate of chatGPT

in the article it says that gpt-3.5-turbo is used to measure the rejection rate. what explains this difference in results for chatGPT given that it is used as a reference?
image

Need openAI key for evaluating LLM?

I would like to thank you for the work you have done and encourage you to continue. I would like to know if an openAI key is required for the evaluation and if it is possible to evaluate models that have undergone quantization.

Unable to reproduce Counterfactual Robustness result with ChatGPT

Here's what I did:

Step1:
python evalue.py --dataset zh --noise_rate 0.0 --modelname chatgpt

Step2:
python fact_evalue.py --dataset zh --modelname chatgpt

I got file prediction_zh_chatgpt_temp0.7_noise0.0_passage5_correct0.0_result with content:

{
    "all_rate": 0.9473684210526315,
    "noise_rate": 0.0,
    "tt": 270,
    "nums": 285
}

And file prediction_zh_chatgpt_temp0.7_noise0.0_passage5_correct0.0_chatgptresult.json with content:

{
    "reject_rate": 0.0,
    "all_rate": 0.9385245901639344,
    "correct_rate": 0,
    "tt": 229,
    "rejecttt": 0,
    "correct_tt": 0,
    "nums": 244,
    "noise_rate": 0.0
}

I failed to see how this matches the results in the paper:
image

Any ideas?

confused by the calculation of accuracy

Is the calculation method for the accuracy metric in the paper consistent with the code in the repository? I'm a bit confused by this piece of code:

tt = 0
for i in results:
    label = i["label"]
    if noise_rate == 1 and label[0] == -1:
        tt += 1
    elif 0 not in label and 1 in label:
        tt += 1
print(tt/len(results))
scores = {
"all_rate": (tt)/len(results),
"noise_rate": noise_rate,
"tt":tt,
"nums": len(results),
}

Benchmarking on the zh.json dataset with a noise_rate of 0.2 implies that out of 1500 prompts, 300 are missing supplementary knowledge. The accuracy calculated from this code will be much lower than the value indicated in the paper's table. Am I misunderstanding something, or was this piece of code previously modified? Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.