heidelberg-nlp / mm-shap Goto Github PK

This is the official implementation of the paper "MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks"

Home Page: https://aclanthology.org/2023.acl-long.223/

License: MIT License

Python 88.08% C++ 10.89% Cuda 1.03%

mm-shap's People

Contributors

Stargazers

Watchers

Forkers

skshvl hawksilent juz031 othmane42

mm-shap's Issues

RuntimeError: cannot register a hook on a tensor that doesn't require gradient

Hello @LetiP,

It's me again :P Thank you for your patience and time.

The spec of my usage GPU: 4x Nvidia GTX 1080 Ti (Pascal, 11GB memory), in 24 cores/48 threads/256 GB memory server

Here is my setting in the beginning of the mm-shap_albef_dataset.py

num_samples = "all"  # "all" or number
if num_samples != "all":
    num_samples = int(num_samples)
checkp = "mscoco"  # refcoco, mscoco, vqa, flickr30k
write_res = "yes"  # "yes" or "no"
task = "image_sentence_alignment"  # image_sentence_alignment, vqa, gqa
other_tasks_than_valse = ['mscoco', 'vqa', 'gqa', 'gqa_balanced', 'nlvr2']
use_cuda = True

DATA = {
    "existence": ["/home/students/cheng/MM-SHAP/visual7w/images",
                  '/home/students/cheng/MM-SHAP/data/existence.json'],
      }

I google for some solutions for this issue, and usually it's related to:

NN forward structure: https://discuss.pytorch.org/t/torch-no-grad-and-register-hook-inside-forward/88976
NN training section: https://discuss.pytorch.org/t/register-hook-throwing-an-error-during-model-eval/65432

However, these two issues sound not like the case I have here.
Do you encounter any similar problem?

Here is the OOM:

Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.

  0%|          | 0/534 [00:00<?, ?it/s]
  0%|          | 0/534 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "mm-shap_albef_dataset.py", line 306, in <module>
    shap_values = explainer(X)
  File "/home/students/cheng/MM-SHAP/shap/explainers/_permutation.py", line 62, in __call__
    batch_size=batch_size, outputs=outputs, silent=silent
  File "/home/students/cheng/MM-SHAP/shap/explainers/_permutation.py", line 76, in __call__
    outputs=outputs, silent=silent
  File "/home/students/cheng/MM-SHAP/shap/explainers/_explainer.py", line 260, in __call__
    batch_size=batch_size, outputs=outputs, silent=silent, **kwargs
  File "/home/students/cheng/MM-SHAP/shap/explainers/_permutation.py", line 134, in explain_row
    outputs = fm(masks, zero_index=0, batch_size=batch_size)
  File "/home/students/cheng/MM-SHAP/shap/utils/_masked_model.py", line 65, in __call__
    return self._full_masking_call(full_masks, zero_index=zero_index, batch_size=batch_size)
  File "/home/students/cheng/MM-SHAP/shap/utils/_masked_model.py", line 141, in _full_masking_call
    outputs = self.model(*joined_masked_inputs)
  File "/home/students/cheng/MM-SHAP/shap/models/_model.py", line 21, in __call__
    return np.array(self.inner_model(*args))
  File "mm-shap_albef_dataset.py", line 184, in get_model_prediction
    masked_text_inputs.to("cuda"))
  File "/home/students/cheng/anaconda3/envs/shap/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "mm-shap_albef_dataset.py", line 92, in forward
    return_dict=True,
  File "/home/students/cheng/anaconda3/envs/shap/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/students/cheng/MM-SHAP/ALBEF/models/xbert.py", line 1067, in forward
    mode=mode,
  File "/home/students/cheng/anaconda3/envs/shap/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/students/cheng/MM-SHAP/ALBEF/models/xbert.py", line 601, in forward
    output_attentions,
  File "/home/students/cheng/anaconda3/envs/shap/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/students/cheng/MM-SHAP/ALBEF/models/xbert.py", line 504, in forward
    output_attentions=output_attentions,
  File "/home/students/cheng/anaconda3/envs/shap/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/students/cheng/MM-SHAP/ALBEF/models/xbert.py", line 407, in forward
    output_attentions,
  File "/home/students/cheng/anaconda3/envs/shap/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/students/cheng/MM-SHAP/ALBEF/models/xbert.py", line 329, in forward
    attention_probs.register_hook(self.save_attn_gradients)         
  File "/home/students/cheng/anaconda3/envs/shap/lib/python3.6/site-packages/torch/_tensor.py", line 289, in register_hook
    raise RuntimeError("cannot register a hook on a tensor that "
RuntimeError: cannot register a hook on a tensor that doesn't require gradient
srun: error: gpu08: task 0: Exited with exit code 1

Questions from applying mm-shap to the new model (LLaVA-next)

The first question:

MM-SHAP/mm-shap_clip_dataset.py

Line 65 in 00a66bf

masked_X[0, 0] = 49406

I check with this code and try to apply it also on the model I am interested in LLaVA-Next (https://huggingface.co/docs/transformers/model_doc/llava_next). I know the number 49406 (rule out CLS and SEP, it’s 49408-2) represent for vocab_size. Since the same parameters in LLaVA-Next is None by default, I am wondering how to pick an apporpreate number for it, also, with other parameters. If you have any idea of it, please let me know.

The second question:

I find out a example down below:

from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities

# Reference: https://huggingface.co/docs/transformers/model_doc/clip

Usually, it would need a text_input for asking the caption. However, I didn’t see the asking part in ‘mm-shap_clip_dataset.py’

There are some parameters I would need to revise when I implement LLaVA-Next

LLaVA:-Next https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf/blob/main/config.json

image_size: 336
vocab_size: 32000

it seems to me that LLaVA is more complex than Clip model since it seperate a picture into four.

Clip: https://huggingface.co/openai/clip-vit-base-patch32/blob/main/config.json

image_size: 224
vocab_size: 49408

Here is the setting of the expirement I would like to do on LLaVa-Next with MM-Shap metrics.

num_samples: all
task: image_sentence_alignment
Dataset: existence

Call the model for each "new image" generated with masked features in get_model_prediction(x)

I noticed that in the "get_model_prediction(x)" function, the model is called inside the outer loop (e.g., for i in range(input_ids.shape[0])). But shouldn't it be called within the inner loop (e.g., for k in range(masked_image_token_ids[i].shape[0]))? Also, how do I extend this function to trimodal models? Using triple loop or two parallel inner loops?

RuntimeError when Registering Hooks on Tensor in mm_albef_dataset

First of all, Thank you for sharing the code base of this interesting work!

I encountered an issue while trying to run the following command python mm-shap_albef_dataset.py 3 "refcoco" "yes".
Below is the error message I received :
RuntimeError: cannot register a hook on a tensor that doesn't require gradient
Here's the full Traceback:

  0%|                                                                                             | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "mm-shap_albef_dataset.py", line 307, in <module>
    shap_values = explainer(X)
  File "/mnt/c/Users/Documents/work/MM-SHAP/shap/explainers/_permutation.py", line 62, in __call__
    batch_size=batch_size, outputs=outputs, silent=silent
  File "/mnt/c/Users/Documents/work/MM-SHAP/shap/explainers/_permutation.py", line 76, in __call__
    outputs=outputs, silent=silent
  File "/mnt/c/Users/Documents/work/MM-SHAP/shap/explainers/_explainer.py", line 260, in __call__
    batch_size=batch_size, outputs=outputs, silent=silent, **kwargs
  File "/mnt/c/Users/Documents/work/MM-SHAP/shap/explainers/_permutation.py", line 134, in explain_row
    outputs = fm(masks, zero_index=0, batch_size=batch_size)
  File "/mnt/c/Users/Documents/work/MM-SHAP/shap/utils/_masked_model.py", line 65, in __call__
    return self._full_masking_call(full_masks, zero_index=zero_index, batch_size=batch_size)
  File "/mnt/c/Users/Documents/work/MM-SHAP/shap/utils/_masked_model.py", line 141, in _full_masking_call
    outputs = self.model(*joined_masked_inputs)
  File "/mnt/c/Users/Documents/work/MM-SHAP/shap/models/_model.py", line 21, in __call__
    return np.array(self.inner_model(*args))
  File "mm-shap_albef_dataset.py", line 192, in get_model_prediction
    masked_text_inputs.to("cuda"))
  File "/home/anaconda3/envs/shap/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "mm-shap_albef_dataset.py", line 100, in forward
    return_dict=True,
  File "/home/anaconda3/envs/shap/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/c/Users/Documents/work/phd_work/MM-SHAP/ALBEF/models/xbert.py", line 1067, in forward
    mode=mode,
  File "/home/anaconda3/envs/shap/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/c/Users/Documents/work/phd_work/MM-SHAP/ALBEF/models/xbert.py", line 601, in forward
    output_attentions,
  File "/home/anaconda3/envs/shap/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/c/Users/Documents/work/phd_work/MM-SHAP/ALBEF/models/xbert.py", line 504, in forward
    output_attentions=output_attentions,
  File "/home/anaconda3/envs/shap/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/c/Users/Documents/work/phd_work/MM-SHAP/ALBEF/models/xbert.py", line 407, in forward
    output_attentions,
  File "/home/anaconda3/envs/shap/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/c/Users/Documents/work/phd_work/MM-SHAP/ALBEF/models/xbert.py", line 329, in forward
    attention_probs.register_hook(self.save_attn_gradients)
  File "/home/anaconda3/envs/shap/lib/python3.6/site-packages/torch/_tensor.py", line 289, in register_hook
    raise RuntimeError("cannot register a hook on a tensor that "
RuntimeError: cannot register a hook on a tensor that doesn't require gradient

I resolved this issue by setting save_attention=False at this line 215 - mm_albef_dataset.py :

 model.text_encoder.base_model.base_model.encoder.layer[
        block_num].crossattention.self.save_attention = False

My question is, is it mandatory to keep registering the attention gradients to accuretly calculate the textual and visual contributions?

How is Explainer getting image data in CLIP?

As part of my thesis, I am trying to understand the code in mm-shap_clip_dataset.py, and I'm a bit stumped at the following section, in which we generate the tensor X which is passed to the Explainer instance to generate masks and then SHAP values. I am concerned that in the code as it is written here, X ends up containing no image data -- or at least, I do not understand how it does.

# shap values need one sentence for transformer
            for k, sentence in enumerate(test_sentences):

                try:  # image feature extraction can go wrong
                    inputs = processor(
                        text=sentence, images=image, return_tensors="pt", padding=True
                    )
                except:
                    continue
                model_prediction = model(**inputs).logits_per_image[0,0].item()

                text_length_tok = inputs.input_ids.shape[1]
                p = int(math.ceil(np.sqrt(text_length_tok)))
                patch_size = 224 // p
                image_token_ids = torch.tensor(
                    range(1, p**2+1)).unsqueeze(0) # (inputs.pixel_values.shape[-1] // patch_size)**2 +1
                # make a cobination between tokens and pixel_values (transform to patches first)
                X = torch.cat(
                    (inputs.input_ids, image_token_ids), 1).unsqueeze(1)

                # create an explainer with model and image masker
                explainer = shap.Explainer(
                    get_model_prediction, custom_masker, silent=True)
                shap_values = explainer(X)
                mm_score = compute_mm_score(text_length_tok, shap_values)

Specifically, X consists of a concatenation of two things: image_token_ids (image) and inputs.input_ids (text)

                # make a cobination between tokens and pixel_values (transform to patches first)
                X = torch.cat(
                    (inputs.input_ids, image_token_ids), 1).unsqueeze(1)

But while the inputs object contains both text and image data, image_token_ids seems to take no image data from the inputs object's pixel_values (other than in its shape).

image_token_ids = torch.tensor(
                    range(1, p**2+1)).unsqueeze(0) # (inputs.pixel_values.shape[-1] // patch_size)**2 +1

Then, by the time we generate the concatenation X, we are combining inputs.input_ids and image_token_ids without having added anything to image_token_ids.

Right after X is assigned, we create an Explainer and pass X to it.



                # create an explainer with model and image masker
                explainer = shap.Explainer(
                    get_model_prediction, custom_masker, silent=True)
                shap_values = explainer(X)

So what I am trying to understand is how does the explainer gets any access to the image data when X consists only of the text data + the blank image_token_ids? Would appreciate any input, thanks!

Models on the VQA task

Hello,

Thank you for your work!

I am trying to understand how the reported Shapley values were estimated for the VQA/GQA tasks. Here are some specific questions:

Are the question and answer of each instance concatenated together for textual input to the model (LXMERT/ALBEF-VQA)?
What model output is being distributed among the tokens? final argmax probability?

Parameters that affect GPU RAM usage

Thanks for the wrok. I tried to reproduce the results of this paper, however I encountered the problem of insufficient GPU RAM.

The python file I am trying: mm-shap_albef_dataset.py
DATA I used: existence
number of sample: all

The basic setting I used:

# num_samples = sys.argv[1] # "all" or number
num_samples = "all" # "all" or number
if num_samples != "all":
    num_samples = int(num_samples)
# checkp = sys.argv[2] #  refcoco, mscoco, vqa, flickr30k
checkp = "mscoco" #  refcoco, mscoco, vqa, flickr30k
# write_res = sys.argv[3] # "yes" or "no"
write_res = "yes" # "yes" or "no"
task = "image_sentence_alignment"  # image_sentence_alignment, vqa, gqa
other_tasks_than_valse = ['mscoco', 'vqa', 'gqa', 'gqa_balanced', 'nlvr2']
use_cuda = True

The problem I encounter:

RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 2.00 GiB total capacity; 1008.19 MiB already allocated; 11.44 MiB free; 1.04 GiB reserved in total by PyTorch)

I am wondering if I cannot scale my GPU RAM right now, which Parameters I should pay attention to?

following few option on my mind now:

num_samples
patch_size (?)
I try this today, but it will affect the shape torch.Size. I only change it from 16 to 32

RuntimeError: Error(s) in loading state_dict for VL_Transformer_ITM:
        size mismatch for visual_encoder.pos_embed: copying a param with shape torch.Size([1, 577, 768]) from checkpoint, the shape in current model is torch.Size([1, 145, 768]).
        size mismatch for visual_encoder.patch_embed.proj.weight: copying a param with shape torch.Size([768, 3, 16, 16]) from checkpoint, the shape in current model is torch.Size([768, 3, 32, 32]).

Thanks for your reading and time in advance.

Unable to locate "existence_benchmark.test_mturk.json"

Description:

I'm currently working on MM-Shap and I'm in need of the file existence_benchmark.test_mturk.json. Can someone provide guidance on how to obtain this file or is there a specific procedure to generate it?

if I use the existence.json in VALSE, it also reminder me that "test_sentences = [foil["caption"], foil["foils"][0]]
KeyError: 'foils'"

heidelberg-nlp / mm-shap Goto Github PK

mm-shap's People

Contributors

Stargazers

Watchers

Forkers

mm-shap's Issues

RuntimeError: cannot register a hook on a tensor that doesn't require gradient

Questions from applying mm-shap to the new model (LLaVA-next)

Call the model for each "new image" generated with masked features in get_model_prediction(x)

RuntimeError when Registering Hooks on Tensor in mm_albef_dataset

How is Explainer getting image data in CLIP?

Models on the VQA task

Parameters that affect GPU RAM usage

Unable to locate "existence_benchmark.test_mturk.json"

Description:

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent