Hello, sorry to bother you again. Your work is very interesting and we might want to b

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Zero-Shot Question-Answer Evaluation of Accuracy about video-chatgpt HOT 11 CLOSED

mbzuai-oryx commented on May 29, 2024

Zero-Shot Question-Answer Evaluation of Accuracy

from video-chatgpt.

Comments (11)

hanoonaR commented on May 29, 2024

Hi @msra-jqxu ,

Thank you for your interest in our work and for reaching out with your questions.

As we explained in our paper, our evaluations are performed in a zero-shot manner. This process makes use of GPT-assisted evaluation to assess the model’s capabilities. We carry out a comparison between the model's generated response and the ground truth (GT) using a prompted GPT-3 model, which gauges the correctness of the model's responses. The evaluation process measures the accuracy of the model’s predictions and assigns a relative score on a scale of 1-5.

For your convenience, we provide inference and evaluation scripts as examples, as explained in the quantitative evaluation readme for performing zero-shot QA evaluation. The evaluation script designed for ActivityNet can also be used to assess performance on the MSVD-QA dataset.

I hope this clarifies your queries. Please do not hesitate to reach out if you have any more questions.

from video-chatgpt.

msra-jqxu commented on May 29, 2024

Hi @hanoonaR ,
Thank you for your answer! I know score 1-5 is measured by gpt3.5. I am curious about how Accuracy 64.9 in the MSVD-VQ column is obtained, because it seems that the corresponding code is not given in the repo.

Thanks！

from video-chatgpt.

hanoonaR commented on May 29, 2024

Hi @msra-jqxu

Thanks for your follow-up question!

The accuracy that you're referring to (64.9 in the MSVD-QA column) is calculated by asking the GPT-3.5 model to make a binary judgment on the correctness of the prediction compared to the actual answer. This is done in the form of a 'yes' or 'no' response. Each 'yes' contributes to the total number of correct responses, which is then used to compute the accuracy.

This process is reflected in the evaluation script - line 190 in our repository. If you examine the code, you'll see that the model is asked to evaluate the correctness of the prediction, and this binary evaluation is used in the calculation of the accuracy.

I hope this clarifies your question about the calculation of accuracy. Please let us know if you have any other questions.

from video-chatgpt.

JoseponLee commented on May 29, 2024

Hello, I've tried downloading the ActivateNet dataset you provided and conducted a zero-shot evaluation. Even though I noticed that a small portion of the test cases failed, the final results I got are still significantly different from what you reported. My accuracy is 0.216875 and the avg score is 3.31125, while you reported 0.352/2.7. I am wondering where the problem might lie. I noticed that some test cases failed, but their number is very small and not enough to cause such a large performance difference.

from video-chatgpt.

JoseponLee commented on May 29, 2024

Additionally, during the test, I encountered some errors as shown in the picture below. I'm not sure what caused these errors and whether they have any impact on the results.

from video-chatgpt.

msra-jqxu commented on May 29, 2024

Hi, @hanoonaR ,
Thanks for your kind answer and now I can successfully test!
I use the pretrained llama model checkpoint provide in LLaVA-7B-Lightening-v1-1 and video_chatgpt model in Video-ChatGPT-7B. But I have encountered the similar problem as @JoseponLee. I tested on MSVD-QA and the performance is shown below. What is shown in the paper is 64.9/3.3 and what I get is 77.2/3.86. I also have 47 samples chatgpt can't give an answer due to OpenAI's content management policy, but should Not enough to affect performance.

from video-chatgpt.

hanoonaR commented on May 29, 2024

Hi @JoseponLee and @msra-jqxu,

Thank you for bringing these issues to our attention.

I have looked into the evaluations of ActivityNet and re-computed the saved evaluations from GPT-3.5. These evaluations were computed in June, when the code was released. I was able to reproduce the results we initially reported: Accuracy of 35.2 and a score of 2.7. I've uploaded the GPT evaluations for your reference here.

Further, I re-evaluated the predictions using the same pre-calculated predictions that were used for the aforementioned GPT-3.5 evaluation. I observed a difference in the reported performance with the accuracy shifting from 35.2 to 43.6. You can find the new GPT evaluations here. I have also uploaded the model predictions (which have been consistent across all the tests) here.

We suspect these differences may arise due to variations in the GPT models(API) over time. For more information on this, please refer to this paper.

As a temporary and immediate workaround, we request you to re-evaluate our model using the shared predictions when comparing your performance with our models. We appreciate your patience and understanding as we work on a more permanent solution using a static model. This should be available soon.

Please let us know if you have any other questions or concerns.

from video-chatgpt.

msra-jqxu commented on May 29, 2024

Hi @hanoonaR ,
Thanks for replying! I will test later using the pred.json file you provided.
But I also encountered the same problem as @JoseponLee while testing activitynet. Seems to be due to corrupted videos. Do you know how to fix this? Or can it be ignored?

from video-chatgpt.

hanoonaR commented on May 29, 2024

Hi @msra-jqxu ,

I would suggest ignoring them, as they are warnings.

from video-chatgpt.

msra-jqxu commented on May 29, 2024

Hi @hanoonaR ,
I notice that 'index' always is 0 and not be changed in "video_chatgpt/eval/run_inference_activitynet_qa.py"(code). This seems to result in all answer='no'. Maybe we should make index+=1, in each iteration.

from video-chatgpt.

hanoonaR commented on May 29, 2024

Hi @msra-jqxu

Thank you for catching that. You are correct, the 'index' should be incremented in each iteration. I've gone ahead and made the necessary update to the code.

Regarding the inference results previously shared, they are indeed accurate as they were produced using the correct version of the code, with the 'index' incrementing as intended.

Thanks again for your diligence.

from video-chatgpt.

Zero-Shot Question-Answer Evaluation of Accuracy about video-chatgpt HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent