Giter Club home page Giter Club logo

Comments (11)

hanoonaR avatar hanoonaR commented on May 29, 2024

Hi @msra-jqxu ,

Thank you for your interest in our work and for reaching out with your questions.

As we explained in our paper, our evaluations are performed in a zero-shot manner. This process makes use of GPT-assisted evaluation to assess the model’s capabilities. We carry out a comparison between the model's generated response and the ground truth (GT) using a prompted GPT-3 model, which gauges the correctness of the model's responses. The evaluation process measures the accuracy of the model’s predictions and assigns a relative score on a scale of 1-5.

For your convenience, we provide inference and evaluation scripts as examples, as explained in the quantitative evaluation readme for performing zero-shot QA evaluation. The evaluation script designed for ActivityNet can also be used to assess performance on the MSVD-QA dataset.

I hope this clarifies your queries. Please do not hesitate to reach out if you have any more questions.

from video-chatgpt.

msra-jqxu avatar msra-jqxu commented on May 29, 2024

Hi @hanoonaR ,
Thank you for your answer! I know score 1-5 is measured by gpt3.5. I am curious about how Accuracy 64.9 in the MSVD-VQ column is obtained, because it seems that the corresponding code is not given in the repo.

Thanks!
image

from video-chatgpt.

hanoonaR avatar hanoonaR commented on May 29, 2024

Hi @msra-jqxu

Thanks for your follow-up question!

The accuracy that you're referring to (64.9 in the MSVD-QA column) is calculated by asking the GPT-3.5 model to make a binary judgment on the correctness of the prediction compared to the actual answer. This is done in the form of a 'yes' or 'no' response. Each 'yes' contributes to the total number of correct responses, which is then used to compute the accuracy.

This process is reflected in the evaluation script - line 190 in our repository. If you examine the code, you'll see that the model is asked to evaluate the correctness of the prediction, and this binary evaluation is used in the calculation of the accuracy.

I hope this clarifies your question about the calculation of accuracy. Please let us know if you have any other questions.

from video-chatgpt.

JoseponLee avatar JoseponLee commented on May 29, 2024

Hello, I've tried downloading the ActivateNet dataset you provided and conducted a zero-shot evaluation. Even though I noticed that a small portion of the test cases failed, the final results I got are still significantly different from what you reported. My accuracy is 0.216875 and the avg score is 3.31125, while you reported 0.352/2.7. I am wondering where the problem might lie. I noticed that some test cases failed, but their number is very small and not enough to cause such a large performance difference.
1690176756686

from video-chatgpt.

JoseponLee avatar JoseponLee commented on May 29, 2024

Additionally, during the test, I encountered some errors as shown in the picture below. I'm not sure what caused these errors and whether they have any impact on the results.
1690177246677

from video-chatgpt.

msra-jqxu avatar msra-jqxu commented on May 29, 2024

Hi, @hanoonaR ,
Thanks for your kind answer and now I can successfully test!
I use the pretrained llama model checkpoint provide in LLaVA-7B-Lightening-v1-1 and video_chatgpt model in Video-ChatGPT-7B. But I have encountered the similar problem as @JoseponLee. I tested on MSVD-QA and the performance is shown below. What is shown in the paper is 64.9/3.3 and what I get is 77.2/3.86. I also have 47 samples chatgpt can't give an answer due to OpenAI's content management policy, but should Not enough to affect performance.

image

from video-chatgpt.

hanoonaR avatar hanoonaR commented on May 29, 2024

Hi @JoseponLee and @msra-jqxu,

Thank you for bringing these issues to our attention.

I have looked into the evaluations of ActivityNet and re-computed the saved evaluations from GPT-3.5. These evaluations were computed in June, when the code was released. I was able to reproduce the results we initially reported: Accuracy of 35.2 and a score of 2.7. I've uploaded the GPT evaluations for your reference here.
image

Further, I re-evaluated the predictions using the same pre-calculated predictions that were used for the aforementioned GPT-3.5 evaluation. I observed a difference in the reported performance with the accuracy shifting from 35.2 to 43.6. You can find the new GPT evaluations here. I have also uploaded the model predictions (which have been consistent across all the tests) here.
image

We suspect these differences may arise due to variations in the GPT models(API) over time. For more information on this, please refer to this paper.

As a temporary and immediate workaround, we request you to re-evaluate our model using the shared predictions when comparing your performance with our models. We appreciate your patience and understanding as we work on a more permanent solution using a static model. This should be available soon.

Please let us know if you have any other questions or concerns.

from video-chatgpt.

msra-jqxu avatar msra-jqxu commented on May 29, 2024

Hi @hanoonaR ,
Thanks for replying! I will test later using the pred.json file you provided.
But I also encountered the same problem as @JoseponLee while testing activitynet. Seems to be due to corrupted videos. Do you know how to fix this? Or can it be ignored?

image

from video-chatgpt.

hanoonaR avatar hanoonaR commented on May 29, 2024

Hi @msra-jqxu ,

I would suggest ignoring them, as they are warnings.

from video-chatgpt.

msra-jqxu avatar msra-jqxu commented on May 29, 2024

Hi @hanoonaR ,
I notice that 'index' always is 0 and not be changed in "video_chatgpt/eval/run_inference_activitynet_qa.py"(code). This seems to result in all answer='no'. Maybe we should make index+=1, in each iteration.

image

from video-chatgpt.

hanoonaR avatar hanoonaR commented on May 29, 2024

Hi @msra-jqxu

Thank you for catching that. You are correct, the 'index' should be incremented in each iteration. I've gone ahead and made the necessary update to the code.

Regarding the inference results previously shared, they are indeed accurate as they were produced using the correct version of the code, with the 'index' incrementing as intended.

Thanks again for your diligence.

from video-chatgpt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.