mbzuai-oryx / groundinglmm Goto Github PK

[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

Home Page: https://grounding-anything.com

Python 99.60% Shell 0.40%

foundation-models lmm vision-and-language vision-language-model llm-agent

groundinglmm's People

Contributors

Stargazers

Watchers

groundinglmm's Issues

Release of pre-training instructions?

Hi!

I have recently taken a great interest in your work! However I was wondering: are you planning on releasing the pre-training code/instructions as well? I would love to experiment with training the model from scratch!

Thanks,
Rachel

Looking for 'test_caption.json' of Visual Genome

Hi! Thanks for your great work and open-source efforts! Could you please provide the download link for test_caption.json of Visual Genome?

3D implementation of GLaMM

Hi!

I have been experimenting with your model for quite some time now, specifically on medical imaging data.

I am currently working on looking into possibilities of extending your architecture such that it would be able to encode sequences of images and decode these accordingly to obtain 3D segmentations.

I was curious if you maybe have a take on how to tackle this. It would greatly help me, as I am doing my master's thesis on LMMs in medical imaging with your model as the main focus of interest! :)

Thank you in advance,
Rachel

Date for release of code, model and data

Nice work!
When will you release code, model and data?

Question about memory to be used

First thank you for your great work!

I wonder how much VRAM will be used in inference?

Thank you in advance.

Question about the seg-token mask computation

Hi Authors,
Thanks for the code, and datasets!
I had a question about this line here: mask = input_ids[:, 1:] == self.seg_token_idx Why do we index from the first token? Shouldn't the output hidden states have a 1-1 mapping with the input_ids?

Can you provide a download link for the pth file of the SAM model?

As shown in line 32 of "https://github.com/mbzuai-oryx/groundingLMM/tree/main/scripts/finetune_glamm_refsegm.sh", training requires a SAM pre-trained pth file, but it is not available in Model zoo. Please provide it. @mmaaz60

For further information visit https://errors.pydantic.dev/2.5/v/missing

Dear professor, could you please answer my question? I completed the installation according to the environment you provided, but the following errors still appeared in the code. The reply chat model you provided has not had a corresponding output statement. The positioning box in this picture can be drawn today.

I went to the provided website to check for my local error but was unable to provide relevant information. Please provide an answer. Please.

About region caption

the generated results only describe the content and not the answer for the specified prompt.

result:

about create_seg_token_mask

Hi! Thanks for your great work. I have some doubt about this following code:
def _create_seg_token_mask(self, input_ids):
mask = input_ids[:, 1:] == self.seg_token_idx
return torch.cat(
[torch.zeros((mask.shape[0], 575)).bool().cuda(), mask, torch.zeros((mask.shape[0], 1)).bool().cuda()],
dim=1
)

Can you explain the meaning of the number 575? And why concat these zero vectors to mask in the left and right side? Thanks in advance for your answering!

Undefined `self.base_dir` in `GranDfDataset.init`

Hello, it seems that in the following code, self.base_dir is not defined before calling super().__init__, and would raise AttributeError.

groundingLMM/dataset/gcg_datasets/GranDf_gcg_ds.py

Lines 160 to 173 in 1c04c3b

 class GranDfDataset(GCGBaseDataset): 

 """ 

  Human annotated dataset proposed in GLaMM as part of GranDf dataset. 

  """ 

 def __init__(self, dataset_dir, tokenizer, global_image_encoder, epoch_samples=8000, precision="fp32", 

 image_size=224, num_classes_per_sample=3, validation=False, random_sampling=True): 

 json_path = "GranDf_HA_GCG_train.json" 

 image_dir = os.path.join(self.base_dir, "GranDf_HA_images", "train") 

 mode = "Val" if validation else "Train" 

 super().__init__( 

 dataset_dir, tokenizer, global_image_encoder, epoch_samples, precision, image_size, num_classes_per_sample, 

 validation, random_sampling, image_dir, json_path, ) 

 print('\033[92m' + "----GCG-{}: GranDf-GCG dataset initialized----".format(mode) + '\033[0m')

Phrase grounding model

Hi,

Are you planning to release the checkpoint of the phrase grounding model? Thank you!

Regards

Data Annotation Pipeline

was wondering if it would be possible to make the execution script for the automated annotation pipeline publicly available. I have reviewed the dataset definition in groundingLMM/dataset, but I am uncertain about the process for generating annotations. Any guidance or access to the execution script would be greatly appreciated.

Data release

Hi! Loved reading the paper. Is there a release date on the data that you've used to train?

Why are train and val in `GCGBaseDataset` reversed

Thanks for your great work but is there a mistake here？

groundingLMM/dataset/gcg_datasets/GranDf_gcg_ds.py

Line 179 in a9892e6

 json_files = {'validation': "OpenPsgGCG_train.json", 'training': "OpenPsgGCG_val.json"} 

may i ask your total parameter?

Demo issue

Hello dear author, your work is quite great! I will reproduce your work code as soon as it is released today.

I have been keeping an eye on your work, and today when I reproduced your code, I encountered the demo section. It has encountered this situation, and the above is the error that occurred after I reproduced it.

There have also been many issues in the code.

Could you please help answer this question.

GrandD Detailed Operation Guide

Your work is of great academic value and significance, and I am very grateful for the contributions you have made. I would like to ask you about the specific operational steps for implementing the GranD Automated Annotation Pipeline. I am very grateful that you could take the time out of your busy schedule to look at my question.

Can you share the `GranD` dataset?

the demo caption is very simple

the demo caption is very simple, not like the detailed one in the paper, did you limit the output max length?

the caption result is quite simple

Fluctuate results on RefCOCO Family when evaluating the referring expression segmentation.

Thank you for sharing your great work!
I am trying to validate the GLaMM-RefSeg model and notice that the performance can vary significantly among different inference times (approximately +/- 1 to 2). Do you have any insights into this phenomenon, or is there any config that I should adjust to achieve more consistent prediction results? Thank you!

local llm interface for glamm

description: glamm performs very well on semantic segmentation. I want to introduce glamm into my multi-agent workflow to solve a sub-task. My multi-agent is constructed by autogen framework.

request: In autogen, we usually provide a local url by litellm for autogen to call other llm model(such as models on ollam), like:
litellm --model ollama/llama2
Is there any similar way for glamm? Thanks!

Inference speed

Hi,

I can successfully now run the code on AMD GPUs but I've noticed that the inference speed is very low. Could this be because I have not installed flash attention (due to the complexity to compile it for AMD) or am I missing something else?

The training losses in the GCG task

Hello, could you please provide a detailed explanation of the training losses in the GCG task? It seems that segmentation task and text generation task are separated. Are there any specific losses to make the specific phrases in the image-level captions and the corresponding segmentation masks macth?

Code release

Hi! Very nice and promising work! When will the code be released? I am really looking forward to experimenting with your code.

Some bugs in the GranD_ReferringSegm_ds.py

Hello, I find there may be several bugs in the GranD_ReferringSegm_ds.py file. Such as:

undefined argument max_gt_per_img in the GrandReferSegmDataset
the incorrect implementation for the create_conversations method
some typos, like data_masks = data_item['maks'] which should be data_masks = data_item['masks']

I would greatly appreciate it if you could address and rectify these concerns, followed by thorough testing of the code at your earliest convenience. Thank you!

Supplementary materials

Hi! Very nice and promising work!

Where can I download the supplementary materials？Thank you!

Easiest way to fine-tune on custom data?

Hello! Thank you for this great work! Is there a preferred way to fine-tune this model on custom data? I am specifically interested in fine-tuning for open-vocabulary segmentation and referring segmentation.

Thank you!

A bug in region captioning evaluation scripts

Hi, thanks for your great work! I just notice there might be a bug in the eval/region_captioning/evaluate.py.

Specifically, when loading generated results from a collection of result files, it uses

for result_file in os.listdir(args.results_dir):
    all_results = json.load(open(f"{args.results_dir}/{result_file}", "r"))
merged_file_path = f"{args.results_dir}/merged.json"

At the end, only the results in the last result file are loaded to all_results. And the model is essentially evaluated on a subset of test set if we use multiple GPUs for inference.

GLaMM-FullScope model generates only a single mask

Hi @hanoonaR
Congrats on the CVPR acceptance. Great work, thank you for sharing the code and the model weights.

I have a couple of questions.

--------------------------------------------- Q1 --------------------------------------------------------
I was trying to reproduce the results using the balloon.jpg image available in the repo using the prompt "Describe the image. Please output interleaved segmentation mask." However the network does not seem to generate multiple masks inspite of the generate text being "The image shows a hot air balloon [SEG] flying over a river [SEG] . The sky [SEG] is visible over the river."

I went a step further to check if the issue is from my side. Below are the generated "generated_output_ids "

[  319, 13563,  1546,   263, 12758,  5199,   322,   385, 23116, 21082,
         20255, 29889,   450, 20255,  4076,  8444, 29892, 13173, 29892,   322,
          1248,   568,  6089,   304,   278,  5199, 29915, 29879,  5155, 29889,
          3148,  1001, 29901,   450, 32000,  -200, 29871, 32001, 16123,  2247,
           385,   975,  1493,   310,   278,  7623, 29889,    13,  4002, 29581,
           278,  1967, 29889,  3529,  1962,  1006,   280, 10511, 10768,   362,
         11105, 29889,   319,  1799,  9047, 13566, 29901,   450,  1967,  3697,
           263, 32005,  7375,  4799,  6411,   417,   265, 32006, 32004, 22764,
           975,   263, 32005,  8580, 32006, 32004,   869,   450, 32005, 14744,
         32006, 32004,   338,  7962,   975,   278,  8580, 29889,     2]

As you can see id 29871(seg_token_idx) is generated only once. I am not sure if I am missing something in my attempts to reproduce the results and I would appreciate your educated guess of what I might be doing wrong.

--------------------------------------------- Q2 --------------------------------------------------------
Another interesting property I observed, when I run tokenizer("[SEG]").input_ids the output indices are [ 1, 29871, 32004] where as tokenizer("a [SEG]").input_ids returns [ 1, 263, 32004] as you can notice the tokenizer outputs id 29871(seg_token_idx) in the first case is this expected, I am curious to understand the intuition behind this.

Thank you, I appreciate any time you can spend to help with my questions.

Regards,
Pradyumna.

Grand-env

Hello respected friend, your environment file seems a bit odd, and I can't even use pip to install some of its contents. ca-certificates=2023.05.30=h06a4308_ does not seem to be the correct format for installation packages.

An error is reported when running eval

Hello, thank you very much for your contribution, I encountered an error while evaluating the GCG task. And the calculated evaluation results also showed errors

Question about Output Quality Difference Between Local and Online Demo for MBZUAI/GLaMM-FullScope

Hello,

I've successfully run the demo locally and managed to obtain output results. However, I've noticed that the quality of the output significantly differs from what is showcased in the online demo, with the local results being notably inferior. I'm currently using the MBZUAI/GLaMM-FullScope for my tests. Could you please shed some light on why there might be such a discrepancy between the two?

Thank you for your assistance.

how to convert finetune weight to huggingface format?

I want to get huggingface model, and inference the huggingface model, the inference code is missing.

when do you plan to release the dataset?

Running GranD Automated Annotation pipeline from scratch

@hanoonaR and @mmaaz60 I wish to run the automated annotation pipeline from scratch. As mentioned in #35 I try the command:
conda create --name grand_env_1 --file requirements_grand_env_1.txt I get the error:

Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  - async-timeout==4.0.2=pypi_0
  - terminaltables==3.1.10=pypi_0
  - ipython==8.14.0=pypi_0
  - pytz==2023.3=pypi_0
  - groundingdino==0.1.0=dev_0
  - openai-whisper==20230314=pypi_0
  - async-lru==2.0.3=pypi_0
  - jupyter-events==0.6.3=pypi_0
  - chardet==5.2.0=pypi_0
  - codecov==2.1.13=pypi_0
  - aiosignal==1.3.1=pypi_0
  - numpy==1.24.3=pypi_0
  - peft==0.3.0=pypi_0
  - fastapi==0.100.0=pypi_0
  - aliyun-python-sdk-kms==2.16.1=pypi_0
  - awq==0.1.0=pypi_0
  - mmcv-full==1.5.0=dev_0
  - multiscaledeformableattention==1.0=pypi_0
  - pycocotools==2.0.6=pypi_0
  - multiprocess==0.70.15=pypi_0
  - importlib-resources==6.0.0=pypi_0
  - pybind11==2.11.1=pypi_0
  - scipy==1.11.1=pypi_0
  - typepy==1.3.1=pypi_0
  - isort==4.3.21=pypi_0
  - mmdet==2.25.3=dev_0
  - onnxruntime==1.15.1=pypi_0
  - exceptiongroup==1.1.2=pypi_0
  - torchvision==0.15.2+cu117=pypi_0
  - supervision==0.11.1=pypi_0
  - nbconvert==7.7.2=pypi_0
  - httpcore==0.17.3=pypi_0
  - jupyter-console==6.6.3=pypi_0
  - jupyter-server-terminals==0.4.4=pypi_0
  - cupy-cuda117==10.6.0=pypi_0
  - qtconsole==5.4.3=pypi_0
  - quant-cuda==0.0.0=pypi_0
  - contourpy==1.1.0=pypi_0
  - yarl==1.9.2=pypi_0
  - setproctitle==1.3.2=pypi_0
  - pathtools==0.1.2=pypi_0
  - oss2==2.17.0=pypi_0
  - deepdiff==6.3.1=pypi_0
  - comm==0.1.3=pypi_0
  - coverage==7.3.0=pypi_0
  - imageio==2.31.1=pypi_0
  - cymem==2.0.7=pypi_0
  - json5==0.9.14=pypi_0
  - jupyter-client==8.3.0=pypi_0
  - keras==2.13.1=pypi_0
  - markdown-it-py==2.2.0=pypi_0
  - einops-exts==0.0.4=pypi_0
  - outdated==0.2.2=pypi_0
  - markupsafe==2.1.3=pypi_0
  - widgetsnbextension==4.0.8=pypi_0
  - pyarrow==12.0.1=pypi_0
  - addict==2.4.0=pypi_0
  - flatbuffers==23.5.26=pypi_0
  - platformdirs==3.10.0=pypi_0
  - prompt-toolkit==3.0.39=pypi_0
  - shortuuid==1.0.11=pypi_0
  - openxlab==0.0.15=pypi_0
  - bleach==6.0.0=pypi_0
  - pyproject-api==1.5.4=pypi_0
  - smmap==5.0.0=pypi_0
  - munkres==1.1.4=pypi_0
  - pyflakes==2.1.1=pypi_0
  - etils==1.3.0=pypi_0
  - anyio==3.7.1=pypi_0
  - dassl==0.6.3=dev_0
  - huggingface-hub==0.16.4=pypi_0
  - thinc==8.1.10=pypi_0
  - typer==0.9.0=pypi_0
  - httpx==0.24.0=pypi_0
  - zstandard==0.21.0=pypi_0
  - nh3==0.2.14=pypi_0
  - jupyterlab-widgets==3.0.8=pypi_0
  - timm==0.5.4=pypi_0
  - accelerate==0.21.0=pypi_0
  - tensorflow-metadata==1.13.1=pypi_0
  - nltk==3.8.1=pypi_0
  - pyparsing==3.0.9=pypi_0
  - texttable==1.6.7=pypi_0
  - openmim==0.3.9=pypi_0
  - opencv-python==4.8.0.74=pypi_0
  - six==1.16.0=pypi_0
  - spacy-alignments==0.9.0=pypi_0
  - spacy==3.6.0=pypi_0
  - spacy-loggers==1.0.4=pypi_0
  - langcodes==3.3.0=pypi_0
  - safetensors==0.3.1=pypi_0
  - wavedrom==2.0.3.post3=pypi_0
  - terminado==0.17.1=pypi_0
  - pure-eval==0.2.2=pypi_0
  - argon2-cffi==21.3.0=pypi_0
  - ninja==1.11.1=pypi_0
  - pycountry==22.3.5=pypi_0
  - overrides==7.3.1=pypi_0
  - hjson==3.1.0=pypi_0
  - nvidia-cuda-cupti-cu11==11.7.101=pypi_0
  - uvicorn==0.23.1=pypi_0
  - virtualenv==20.24.3=pypi_0
  - python-multipart==0.0.6=pypi_0
  - arrow==1.2.3=pypi_0
  - wcwidth==0.2.6=pypi_0
  - typing-inspect==0.9.0=pypi_0
  - trax==1.4.1=pypi_0
  - gdown==4.7.1=pypi_0
  - websockets==11.0.3=pypi_0
  - nbformat==5.9.1=pypi_0
  - onnx==1.14.0=pypi_0
  - astunparse==1.6.3=pypi_0
  - datasets==2.14.4=pypi_0
  - en-core-web-md==3.6.0=pypi_0
  - decorator==5.1.1=pypi_0
  - llava==1.0.0=pypi_0
  - tensorflow==2.13.0=pypi_0
  - pyre-extensions==0.0.29=pypi_0
  - tensorflow-hub==0.14.0=pypi_0
  - xtcocotools==1.13=pypi_0
  - nvidia-cuda-nvrtc-cu11==11.7.99=pypi_0
  - networkx==3.1=pypi_0
  - absl-py==1.4.0=pypi_0
  - kornia==0.6.4=pypi_0
  - gradio-client==0.2.10=pypi_0
  - pycryptodome==3.18.0=pypi_0
  - crcmod==1.7=pypi_0
  - scikit-learn==1.2.2=pypi_0
  - beautifulsoup4==4.12.2=pypi_0
  - toolz==0.12.0=pypi_0
  - dm-tree==0.1.8=pypi_0
  - pluggy==1.2.0=pypi_0
  - starlette==0.27.0=pypi_0
  - lit==16.0.6=pypi_0
  - debugpy==1.6.7=pypi_0
  - srsly==2.4.7=pypi_0
  - tcolorpy==0.1.3=pypi_0
  - en-core-web-trf==3.6.1=pypi_0
  - fsspec==2023.6.0=pypi_0
  - mmpose==0.24.0=dev_0
  - nvidia-nccl-cu11==2.14.3=pypi_0
  - flake8==3.7.9=pypi_0
  - jupyter==1.0.0=pypi_0
  - pycocoevalcap==1.2=pypi_0
  - torch==2.0.1+cu117=pypi_0
  - appdirs==1.4.4=pypi_0
  - click==8.1.6=pypi_0
  - libclang==16.0.6=pypi_0
  - attributedict==0.3.0=pypi_0
  - kiwisolver==1.4.4=pypi_0
  - pycodestyle==2.5.0=pypi_0
  - fschat==0.2.24=pypi_0
  - ipywidgets==8.0.7=pypi_0
  - requests==2.28.2=pypi_0
  - vllm==0.1.3=pypi_0
  - rouge-score==0.1.2=pypi_0
  - opencv-python-headless==4.8.0.74=pypi_0
  - jupyter-server==2.7.0=pypi_0
  - chumpy==0.70=pypi_0
  - littleutils==0.2.2=pypi_0
  - fastrlock==0.8.2=pypi_0
  - argon2-cffi-bindings==21.2.0=pypi_0
  - rfc3986-validator==0.1.1=pypi_0
  - ffmpy==0.3.1=pypi_0
  - numexpr==2.8.5=pypi_0
  - protobuf==4.23.4=pypi_0
  - defusedxml==0.7.1=pypi_0
  - preshed==3.0.8=pypi_0
  - blessings==1.7=pypi_0
  - pydantic==1.10.11=pypi_0
  - nvidia-curand-cu11==10.2.10.91=pypi_0
  - tqdm-multiprocess==0.0.11=pypi_0
  - triton==2.0.0=pypi_0
  - ml-dtypes==0.2.0=pypi_0
  - orjson==3.9.2=pypi_0
  - threadpoolctl==3.2.0=pypi_0
  - nvidia-nvtx-cu11==11.7.91=pypi_0
  - wandb==0.15.5=pypi_0
  - rouge==1.0.1=pypi_0
  - markdown2==2.4.9=pypi_0
  - pyyaml==6.0=pypi_0
  - jsonschema==4.18.4=pypi_0
  - certifi==2023.5.7=pypi_0
  - google-pasta==0.2.0=pypi_0
  - matplotlib-inline==0.1.6=pypi_0
  - detectron2==0.6=dev_0
  - h11==0.14.0=pypi_0
  - pandocfilters==1.5.0=pypi_0
  - gast==0.4.0=pypi_0
  - webencodings==0.5.1=pypi_0
  - matplotlib==3.7.2=pypi_0
  - nvidia-cufft-cu11==10.9.0.58=pypi_0
  - sentencepiece==0.1.99=pypi_0
  - sacrebleu==1.5.0=pypi_0
  - funcsigs==1.0.2=pypi_0
  - backcall==0.2.0=pypi_0
  - nvidia-cudnn-cu11==8.5.0.96=pypi_0
  - spacy-transformers==1.2.5=pypi_0
  - sqlitedict==2.1.0=pypi_0
  - googleapis-common-protos==1.59.1=pypi_0
  - jinja2==3.1.2=pypi_0
  - jax==0.4.13=pypi_0
  - docker-pycreds==0.4.0=pypi_0
  - python-json-logger==2.0.7=pypi_0
  - fire==0.5.0=pypi_0
  - nvidia-cuda-runtime-cu11==11.7.99=pypi_0
  - semantic-version==2.10.0=pypi_0
  - promise==2.3=pypi_0
  - referencing==0.30.0=pypi_0
  - uri-template==1.3.0=pypi_0
  - asttokens==2.2.1=pypi_0
  - importlib-metadata==6.8.0=pypi_0
  - gitpython==3.1.32=pypi_0
  - fonttools==4.41.0=pypi_0
  - ipython-genutils==0.2.0=pypi_0
  - tifffile==2023.8.12=pypi_0
  - aiohttp==3.8.4=pypi_0
  - sentry-sdk==1.28.1=pypi_0
  - uc-micro-py==1.0.2=pypi_0
  - stack-data==0.6.2=pypi_0
  - transformers==4.33.2=pypi_0
  - nvidia-cusolver-cu11==11.4.0.1=pypi_0
  - cmake==3.26.4=pypi_0
  - regex==2023.6.3=pypi_0
  - enchant==0.0.1=pypi_0
  - nvidia-cusparse-cu11==11.7.4.91=pypi_0
  - tokenizers==0.13.3=pypi_0
  - gym==0.26.2=pypi_0
  - tzdata==2023.3=pypi_0
  - fairscale==0.4.4=pypi_0
  - mistune==3.0.1=pypi_0
  - cryptography==41.0.3=pypi_0
  - parso==0.8.3=pypi_0
  - gitdb==4.0.10=pypi_0
  - pillow==9.5.0=pypi_0
  - wrapt==1.15.0=pypi_0
  - rfc3339-validator==0.1.4=pypi_0
  - humanfriendly==10.0=pypi_0
  - prometheus-client==0.17.1=pypi_0
  - frozenlist==1.4.0=pypi_0
  - opt-einsum==3.3.0=pypi_0
  - pytablewriter==1.0.0=pypi_0
  - fastjsonschema==2.18.0=pypi_0
  - confection==0.1.0=pypi_0
  - dill==0.3.7=pypi_0
  - nbclient==0.8.0=pypi_0
  - pathy==0.10.2=pypi_0
  - mpmath==1.3.0=pypi_0
  - isoduration==20.11.0=pypi_0
  - psutil==5.9.5=pypi_0
  - en-core-web-sm==3.6.0=pypi_0
  - entrypoints==0.3=pypi_0
  - aliyun-python-sdk-core==2.13.36=pypi_0
  - jupyter-core==5.3.1=pypi_0
  - pyzmq==25.1.0=pypi_0
  - annotated-types==0.5.0=pypi_0
  - colour-runner==0.1.1=pypi_0
  - tiktoken==0.3.3=pypi_0
  - flash-attn==1.0.7=pypi_0
  - altair==5.0.1=pypi_0
  - ipykernel==6.24.0=pypi_0
  - segment-anything==1.0=dev_0
  - ray==2.6.3=pypi_0
  - ordered-set==4.1.0=pypi_0
  - scikit-image==0.21.0=pypi_0
  - yapf==0.40.1=pypi_0
  - sympy==1.12=pypi_0
  - notebook==7.0.0=pypi_0
  - tinycss2==1.2.1=pypi_0
  - cycler==0.11.0=pypi_0
  - lm-eval==0.3.0=pypi_0
  - jupyterlab==4.0.3=pypi_0
  - idna==3.4=pypi_0
  - lazy-loader==0.3=pypi_0
  - inspecta==0.1.3=pypi_0
  - lmdb==1.4.1=pypi_0
  - openai==0.27.8=pypi_0
  - send2trash==1.8.2=pypi_0
  - colorama==0.4.6=pypi_0
  - jedi==0.18.2=pypi_0
  - jaxlib==0.4.13=pypi_0
  - wilds==1.2.2=pypi_0
  - numba==0.57.1=pypi_0
  - py-cpuinfo==9.0.0=pypi_0
  - auto-gptq==0.4.1+cu117=pypi_0
  - catalogue==2.0.9=pypi_0
  - rpds-py==0.9.2=pypi_0
  - python-dateutil==2.8.2=pypi_0
  - multidict==6.0.4=pypi_0
  - tabledata==1.3.1=pypi_0
  - notebook-shim==0.2.3=pypi_0
  - pandas==2.0.3=pypi_0
  - webcolors==1.13=pypi_0
  - smart-open==6.3.0=pypi_0
  - pydub==0.25.1=pypi_0
  - pickleshare==0.7.5=pypi_0
  - coloredlogs==15.0.1=pypi_0
  - h5py==3.9.0=pypi_0
  - traitlets==5.9.0=pypi_0
  - mccabe==0.6.1=pypi_0
  - nvidia-cublas-cu11==11.10.3.66=pypi_0
  - shapely==2.0.1=pypi_0
  - linkify-it-py==2.0.2=pypi_0
  - xxhash==3.3.0=pypi_0
  - blis==0.7.10=pypi_0
  - opendatalab==0.0.10=pypi_0
  - jsonlines==3.1.0=pypi_0
  - json-tricks==3.17.2=pypi_0
  - qtpy==2.3.1=pypi_0
  - murmurhash==1.0.9=pypi_0
  - grpcio==1.56.0=pypi_0
  - svgwrite==1.4.3=pypi_0
  - zipp==3.16.2=pypi_0
  - aiofiles==23.1.0=pypi_0
  - pathvalidate==3.1.0=pypi_0
  - spacy-legacy==3.0.12=pypi_0
  - tensorflow-io-gcs-filesystem==0.32.0=pypi_0
  - gin-config==0.5.0=pypi_0
  - msgpack==1.0.5=pypi_0
  - ogb==1.3.6=pypi_0
  - awq-inference-engine==0.0.0=pypi_0
  - nest-asyncio==1.5.6=pypi_0
  - tensorflow-datasets==4.9.2=pypi_0
  - tomli==2.0.1=pypi_0
  - deepspeed==0.9.5=pypi_0
  - tb-nightly==2.15.0a20230816=pypi_0
  - jupyterlab-server==2.24.0=pypi_0
  - sacremoses==0.0.53=pypi_0
  - tensorflow-estimator==2.13.0=pypi_0
  - dataproperty==1.0.1=pypi_0
  - filelock==3.12.2=pypi_0
  - rootpath==0.1.1=pypi_0
  - jmespath==0.10.0=pypi_0
  - tensorflow-text==2.13.0=pypi_0
  - jupyterlab-pygments==0.2.2=pypi_0
  - pygments==2.15.1=pypi_0
  - soupsieve==2.4.1=pypi_0
  - gradio==3.35.2=pypi_0
  - pywavelets==1.4.1=pypi_0
  - termcolor==2.3.0=pypi_0
  - ftfy==6.1.1=pypi_0
  - charset-normalizer==3.2.0=pypi_0
  - llvmlite==0.40.1=pypi_0
  - gym-notices==0.0.8=pypi_0
  - pexpect==4.8.0=pypi_0
  - bitsandbytes==0.42.0=pypi_0
  - cython==0.29.36=pypi_0
  - mbstrdecoder==1.1.3=pypi_0
  - model-index==0.1.11=pypi_0
  - einops==0.6.1=pypi_0
  - jsonschema-specifications==2023.7.1=pypi_0
  - mdurl==0.1.2=pypi_0
  - xformers==0.0.20=pypi_0
  - tornado==6.3.2=pypi_0
  - babel==2.12.1=pypi_0
  - ptyprocess==0.7.0=pypi_0
  - pydantic-core==2.3.0=pypi_0
  - rich==13.4.2=pypi_0
  - packaging==23.1=pypi_0
  - mmengine==0.8.2=pypi_0
  - setuptools==60.2.0=pypi_0
  - tqdm==4.66.1=pypi_0
  - joblib==1.3.1=pypi_0
  - tox==4.9.0=pypi_0
  - distlib==0.3.7=pypi_0
  - executing==1.2.0=pypi_0
  - attrs==23.1.0=pypi_0
  - mdit-py-plugins==0.3.3=pypi_0
  - wasabi==1.1.2=pypi_0
  - sniffio==1.3.0=pypi_0
  - black==22.3.0=pypi_0
  - fqdn==1.5.1=pypi_0
  - more-itertools==9.1.0=pypi_0
  - typing-extensions==4.7.1=pypi_0
  - array-record==0.4.0=pypi_0
  - urllib3==2.0.3=pypi_0
  - jupyter-lsp==2.2.0=pypi_0

Current channels:

  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

I think its a channels issue as a normal conda environment yml file has a section defining the channels for the packages. I have also tried adding conda-forge as a channel via conda config --add channels conda-forge, I still get the same error.

Your help in reproducing the environments for the annotation pipeline of the dataset is much appreciated.

Pretraining instructions

First of all, thank you very much for sharing your amazing work.

I might have missed it, but it looks like there are currently no instructions on how to pretrain a model from scratch. Are you planning to share this too?

About GranD Pre-training Dataset

Hello GLaMM Team,

Thank you very much for sharing this fascinating work!

It seems that you have been incrementally uploading the pre-training dataset, GranD, to https://huggingface.co/datasets/MBZUAI/GranD. Just a few clarification questions:

Does the whole dataset use all 11M images from SA-1B?
Any estimation when the upload will be completed?

Thanks,
Shengcao

FlashAttention

Hi,

Thank you for the codebase and the models! I notice that flash attention is one of the dependencies of the project. Since I'm working on AMD GPUs and currently installing flash attention with ROCm support is rather challenging, I was wondering whether the code uses flash attention for training or only in inference. Because if it's used only for training I might skip installing it as I want to use GLaMM mostly for inference. I'm asking this question because I've been looking on another GitHub repo where the instructions suggest to install flash attention only if training is required. Thank you!

the demo caption is very simple, reproduce the result in paper

the demo caption is very simple, not like the detailed one in the paper, did you limit the output max length?

How download images including saiapr_tc-12 under Refer_Segm folder?

Question on reproducing the evaluation/demo performance from pretrained models

Hello, first of all thanks for the great work!

I've been trying to reproduce the evaluation and demo, however I don't find them producing the same quality results as produced in official materials.

Environment

CUDA 12.1, PyTorch 2.1.2+cu121
A100 with 40G RAM
Ubuntu 20.04
Followed the installation doc, and is running on version ba4f2b6

Demo
I set up the gradio environment, run with python app.py --version='GLaMM-FullScope', and tried a few examples listed on the page, but the quality is bad (as shown below). No luck if I change 'GLaMM-FullScope' to other models.

Evaluation
I downloaded COCO train 2014 and refCOCO series, and executed bash eval/referring_seg/run_evaluation.sh 'MBZUAI/GLaMM-RefSeg' './results_refseg_finetuned'.

In my initial attempt in numerical evaluation, the code produces an error from the assert here. I looked into it and found cur_len is always total_len + 2. Without knowledge on how to fix it, I had to comment it out in order to run the script.

Here're the results I have obtained (not finished on every test set, but the performance is obviously bad):

[{'model': './results_refseg_finetuned', 'dataset': 'refcoco|val', 'giou': '0.050665364', 'ciou': '0.1175718'}, {'model': './results_refseg_finetuned', 'dataset': 'refcoco|val', 'giou': '0.049141906', 'ciou': '0.11084828'}, {'model': './results_refseg_finetuned', 'dataset': 'refcoco|testA', 'giou': '0.06668462', 'ciou': '0.15553774'}, {'model': './results_refseg_finetuned', 'dataset': 'refcoco|testB', 'giou': '0.034645554', 'ciou': '0.094948955'}, {'model': './results_refseg_finetuned', 'dataset': 'refcoco+|val', 'giou': '0.0536244', 'ciou': '0.11712953'}, {'model': './results_refseg_finetuned', 'dataset': 'refcoco+|testA', 'giou': '0.075434625', 'ciou': '0.15804651'}, {'model': './results_refseg_finetuned', 'dataset': 'refcoco+|testB', 'giou': '0.04812723', 'ciou': '0.110266894'}]

Any help or clue to resolve the performance issue is appreciated, thanks!

no code no model

mmcv version

Hi,

Currently the code, in particular mmdet supports mmcv version up to 1.5.0. I tried from 1.4.7 that you suggest up to 1.5.0 and as I'm on AMD GPUs mmcv won't install due to bugs in the ROCm support. Later versions address the issue. I successfully installed mmcv 2.1.0 with ROCm support on AMD GPUs. But due to the current limitation of mmdet that accepts only mmcv versions up to 1.5.0 I cannot properly use the code.

Is there any chance that you could update the code so that mmdet can support mmcv 2.1.0? Thank you.

For region level captioning, does the model support multi-region inputs?

When I read the paper, I found the model can handle multiple regions as input，Does this mean that if given an image and multiple boxes as input, the model can generate all the region descriptions at once. But when I look at the code, it seems that the model can only caption one box(region) at a time. If I need to generate captions of all the regions in an image, does that mean I have to infer several times? Looking forward to your reply....

Issue with ngrok Error (ERR_NGROK_8012) on GLaMM Demo Page

Hello,

I am encountering an issue while trying to access the GLaMM Demo Page. The error message I received is as follows:

I tried refreshing the page as suggested in the error message, but the issue persists.

Thank you for your assistance.

Best regards,
w228h

Internal error from sentencepiece

Hi,

I have successfully installed the offline demo as instructed here

While running the command :

 python app.py --version ./GLaMM-FullScope

I am getting following error:

Traceback (most recent call last):
File "/mnt/winD/ML/gitProjs/LLVM/groundingLMM/app.py", line 271, in
tokenizer = setup_tokenizer_and_special_tokens(args)
File "/mnt/winD/ML/gitProjs/LLVM/groundingLMM/app.py", line 42, in setup_tokenizer_and_special_tokens
tokenizer = AutoTokenizer.from_pretrained(
File "/home/srikrishna/Install/conda/conda/envs/glamm/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 682, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/srikrishna/Install/conda/conda/envs/glamm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1805, in from_pretrained
return cls._from_pretrained(
File "/home/srikrishna/Install/conda/conda/envs/glamm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1959, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/srikrishna/Install/conda/conda/envs/glamm/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 71, in init
self.sp_model.Load(vocab_file)
File "/home/srikrishna/Install/conda/conda/envs/glamm/lib/python3.10/site-packages/sentencepiece/init.py", line 905, in Load
return self.LoadFromFile(model_file)
File "/home/srikrishna/Install/conda/conda/envs/glamm/lib/python3.10/site-packages/sentencepiece/init.py", line 310, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

Thanks in advance for your time.

Online Demo Down

Hello Authors of GLaMM,

Thank you for sharing your excellent work!

This is just a notification that your online demo (https://glamm.mbzuai-oryx.ngrok.app) is not working now. Would it be possible to fix it?

Best,
Shengcao

how the relationships are formed using objects from level-1?

As mentioned in the title, Section 4.2, "Relationships and Landmarks," presents some points that may cause confusion:

1). Could you clarify how relationships are established using objects from Level 1?
2) . What was the rationale behind introducing the landmark category at this stage? Were there other considerations involved?

=====================================================
I wonder about the relationships derived from short captions generated by LLM?

Empty output when inferring on the example image.

I used the GLaMM-FullScope model to perform inference on a sample image and received a peculiar output. I've verified the versions of the relevant installed libraries, and they align with the specified requirements. How can I address this problem?

	class GranDfDataset(GCGBaseDataset):
	"""
	Human annotated dataset proposed in GLaMM as part of GranDf dataset.
	"""
	def __init__(self, dataset_dir, tokenizer, global_image_encoder, epoch_samples=8000, precision="fp32",
	image_size=224, num_classes_per_sample=3, validation=False, random_sampling=True):
	json_path = "GranDf_HA_GCG_train.json"
	image_dir = os.path.join(self.base_dir, "GranDf_HA_images", "train")
	mode = "Val" if validation else "Train"

	super().__init__(
	dataset_dir, tokenizer, global_image_encoder, epoch_samples, precision, image_size, num_classes_per_sample,
	validation, random_sampling, image_dir, json_path, )
	print('\033[92m' + "----GCG-{}: GranDf-GCG dataset initialized----".format(mode) + '\033[0m')

mbzuai-oryx / groundinglmm Goto Github PK

groundinglmm's People

Contributors

Stargazers

Watchers

Forkers

groundinglmm's Issues

Recommend Projects

Recommend Topics

Recommend Org