baaivision / tokenize-anything Goto Github PK

View Code? Open in Web Editor NEW

494.0 6.0 19.0 6.87 MB

[ECCV 2024] Tokenize Anything via Prompting

License: Apache License 2.0

Jupyter Notebook 95.34% Python 4.66%

foundation-models representation-learning multimodal promptable

tokenize-anything's Introduction

Tokenize Anything via Prompting

Ting Pan^1,2*, Lulu Tang^2*, Xinlong Wang^2¶, Shiguang Shan¹

¹ICT-CAS, ²BAAI
^* Equal Contribution, ^¶Project Lead

[Paper] [🤗 Demo]

We present Tokenize Anything via Prompting, a unified and promptable model capable of simultaneously segmenting, recognizing, and captioning arbitrary regions, with flexible visual prompts (point, box and sketch). The model is trained with exhaustive segmentation masks sourced from SA-1B, coupled with semantic priors from a pre-trained EVA-CLIP with 5 billion parameters.

Installation

Preliminaries

torch >= 2.1

flash-attn >= 2.3.3 (for TextGeneration)

gradio-image-prompter (for GradioApp, Install from URL)

Installing Package

Clone this repository to local disk and install:

cd tokenize-anything && pip install .

You can also install from the remote repository:

pip install git+ssh://[email protected]/baaivision/tokenize-anything.git

Quick Start

Development

The TAP models can be used for diverse vision and language tasks.

We adopt a modular design that decouples all components and predictors.

As a best practice, implement your custom predictor and asynchronous pipeline as follows:

from tokenize_anything import model_registry

with <distributed_actor>:
    model = model_registry["<model_type>"](checkpoint="<path/to/checkpoint>")
    results = <custom_predictor>(model, *args, **kwargs)

server.collect_results()

See builtin examples (web-demo and evaluations) provided in scripts for more details.

Inference

See Inference Guide.

See Concept Guide.

Evaluation

See Evaluation Guide for TAP-H.

See Evaluation Guide for TAP-L.

See Evaluation Guide for TAP-B.

Models

Model weights

V1.1 Release Notes

Three versions of the model are available with different image encoders.
Use a longer pre-training and fine-tuning schedule (improved segmentation and caption performance).
Apply weight decay for all bias parameters (avoid FP16 overflow in QK matmul).
Sample point prompts from predicted mask instead of GT box during VG training.

Model	Description	Schedule	MD5	Weights
tap_vit_h	ViT-H TAP v1.1 model	(100% SA-1B, 180k), (VG, 50ep)	4bdfb9	🤗 HF link
tap_vit_l	ViT-L TAP v1.1 model	(100% SA-1B, 180k), (VG, 50ep)	c1d41f	🤗 HF link
tap_vit_b	ViT-B TAP v1.1 model	(100% SA-1B, 180k), (VG, 50ep)	707f80	🤗 HF link

V1.0 Release Notes

Two versions of the model are available with different image encoders.
Original paper results.

Model	Description	Schedule	MD5	Weights
tap_vit_l	ViT-L TAP v1.0 model	(50% SA-1B, 90k), (VG, 25ep)	03f8ec	🤗 HF link
tap_vit_b	ViT-B TAP v1.0 model	(50% SA-1B, 90k), (VG, 25ep)	b45cbf	🤗 HF link

Concept weights

Note: You can generate these weights following the Concept Guide.

Concept	Description	Weights
Merged-2560	Merged concepts	🤗 HF link
LVIS-1203	LVIS concepts	🤗 HF link
COCO-80	COCO concepts	🤗 HF link

License

Apache License 2.0

Citation

@article{pan2023tap,
  title={Tokenize Anything via Prompting},
  author={Pan, Ting and Tang, Lulu and Wang, Xinlong and Shan, Shiguang},
  journal={arXiv preprint arXiv:2312.09128},
  year={2023}
}

Acknowledgement

We thank the repositories: SAM, EVA, LLaMA, FlashAttention, Gradio, Detectron2 and CodeWithGPU.

tokenize-anything's People

Contributors

Stargazers

Watchers

Forkers

ailabteam thanhpham1987 eltociear zhangzw12319 strategist922 liuqinglong110 keyman9848 sunpro108 techthiyanes nekonekoni7 fujianhai johnwick123f wangtongqingai shui-mu haochen-wang409 hongminmu longfei98 lenville cv-seg

tokenize-anything's Issues

Question about experiment setting.

Awesome work, Congratulations!
I have some questions about the experiment setting.

In Zero-shot instance segmentation, you still use ViTDet classification result, However, TAP model can generate semantic token and do classification, have you tried regard ViTDet as a pure object proposal network and use TAP classification result for this task?
In Zero-shot instance classification, cropping->clip create a strong baseline. I have tried this before but cannot achieve AP of yours. In my implementation. I test by center crop 1.5x scale-upped square area. Are there any other tricks to improve the classification accuracy?
Similar to 2, Does your data annotation process using different tricks to process the image cropping sent into CLIP? Did you use SA gt mask do background blurring or something else. (I see you do paste to place masked object onto a pure color background in Fig7. I suspect this hurt context information a lot for CLIP to generate correct image feature. Does this mean a little finetuning of CLIP model (like MaskAdaptedCLIP or Alpha-CLIP) can extract better feature for knowledge distillation?)

Question on Heuristic Routing Strategy

I've been exploring the implementation of the heuristic routing strategy within your project and came across a specific operation that piqued my curiosity. Specifically, I noticed that the strategy doesn't directly utilize the first bounding box (IOU score index 0) prediction result for routing decisions. Instead, there seems to be an operation where the initial mask prediction result is modified by subtracting 1000 from it.

Could you please clarify the underlying principle behind this approach? I'm particularly interested in understanding:

The rationale for not using the first BBox prediction result as-is for heuristic routing.
The significance and expected impact of subtracting 1000 from the initial mask prediction result.

I believe understanding this could greatly enhance my comprehension of the heuristic routing strategy's design and its implications on the system's overall performance.

Looking forward to your insights.

Thank you for your time and consideration.

Question about CLIP crop baseline

Hi, sorry to bother you, but I still have trouble achieving 40 AP on LVIS with CLIP baseline. I input image by padding shorter edge.
These are image before CLIP standard transformation(resize to 224 and normalize based on ImageNet statistics)
CLIP standard transform

Compose(
    ToTensor()
    Resize(size=(224, 224), interpolation=bicubic, max_size=None, antialias=None)
    Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
)

input image(with annotation id 1-5 in LVIS_v1_val from left to right)

I use a photo of a {} as text prompt with [x['name'] for x in lvis.cats.values()] for class name. But I can only get result of 25.4 AP using LVIS standard API.

Is there anything import missing? Or if it possible to share your code of CLIP baseline.

By the way, I find a paper submission in ICLR-24 with CLIP baseline with similar overall AP of yours but opposite APr, APc, APf result.

TypeError: 'NoneType' object is not callable

Thanks for your contribution, I ran into some problems with the Inference Guide, I have the same output as In [5] : (1024, 964, 3) <- (1264, 1190) * (0.810126582278481, 0.810126582278481), but I met some problems In [7].
The error is reported as follows:
Traceback (most recent call last):
File "f:\tokenize\notebooks\tap.py", line 65, in
outputs = model.get_outputs(inputs)
File "D:\anaconda3\envs\tokenize\lib\site-packages\tokenize_anything\modeling\image_tokenizer.py", line 103, in get_outputs
return self.image_decoder(inputs)
File "D:\anaconda3\envs\tokenize\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\anaconda3\envs\tokenize\lib\site-packages\tokenize_anything\modeling\image_decoder.py", line 219, in forward
outputs = self.get_outputs(inputs)
File "D:\anaconda3\envs\tokenize\lib\site-packages\tokenize_anything\modeling\image_decoder.py", line 199, in get_outputs
query, key = self.transformer(query, key, query, inputs["img_pos"])
File "D:\anaconda3\envs\tokenize\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\anaconda3\envs\tokenize\lib\site-packages\tokenize_anything\modeling\image_decoder.py", line 138, in forward
query, key = blk(query, key, query_pos, key_pos)
File "D:\anaconda3\envs\tokenize\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\anaconda3\envs\tokenize\lib\site-packages\tokenize_anything\modeling\image_decoder.py", line 96, in forward
query = self.norm1(self.self_attn(query, query, query))
File "D:\anaconda3\envs\tokenize\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\anaconda3\envs\tokenize\lib\site-packages\tokenize_anything\modeling\image_decoder.py", line 66, in forward
o = flash_attn_func(q, k, v, softmax_scale=self.scale)
TypeError: 'NoneType' object is not callable

My environment is as follows:
opencv-python=4.9.0.80
Pillow=10.0.1
gradio-image-prompter=3.23.0
sentencepiece=0.1.99
torch=1.13+cu116
flash-attn=2.4.2
python=3.8.18

TypeError: 'NoneType' object is not callable

Wonderful work！
When I configured the environment and gradually debugged the inference.ipynb, I got the error on Visual Prompt Decoding: Point: TypeError: 'NoneType' object is not callable.
The detailed error information is as follows:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[19], line 21
     19 sem_tokens, sem_embeds = outputs["sem_tokens"], outputs["sem_embeds"]
     20 concepts, scores = model.predict_concept(sem_embeds[mask_index])
---> 21 captions = model.generate_text(sem_tokens[mask_index][:, None, :])
     23 # Display comprehensive visual understanding.
     24 text_contents = [v.flatten()[0] for v in (concepts, iou_scores, scores, captions)]

File [~/anaconda3/envs/sam/lib/python3.8/site-packages/torch/autograd/grad_mode.py:27](https://file+.vscode-resource.vscode-cdn.net/home/zhong/code/SAM/tokenize-anything/notebooks/~/anaconda3/envs/sam/lib/python3.8/site-packages/torch/autograd/grad_mode.py:27), in _DecoratorContextManager.__call__..decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File [~/anaconda3/envs/sam/lib/python3.8/site-packages/tokenize_anything/modeling/image_tokenizer.py:189](https://file+.vscode-resource.vscode-cdn.net/home/zhong/code/SAM/tokenize-anything/notebooks/~/anaconda3/envs/sam/lib/python3.8/site-packages/tokenize_anything/modeling/image_tokenizer.py:189), in ImageTokenizer.generate_text(self, visual_tokens, max_gen_len, temperature)
    187 decode_seq_len = cur_pos - prev_pos
    188 x = torch.from_numpy(tokens[:, prev_pos:cur_pos]).to(device=prompts.device)
--> 189 logits = self.text_decoder.transformer(prompts, x, prev_pos)
    190 next_logits = logits[: x.size(0), decode_seq_len - 1]
    191 if temperature > 0:

File [~/anaconda3/envs/sam/lib/python3.8/site-packages/torch/nn/modules/module.py:1194](https://file+.vscode-resource.vscode-cdn.net/home/zhong/code/SAM/tokenize-anything/notebooks/~/anaconda3/envs/sam/lib/python3.8/site-packages/torch/nn/modules/module.py:1194), in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
...
---> 73 q = apply_rotary_emb(q, cos, sin, interleaved=True, inplace=inplace)
     74 k = apply_rotary_emb(k, cos, sin, interleaved=True, inplace=inplace)
     75 return q, k

TypeError: 'NoneType' object is not callable

The gt label for captions how to gernerated ?

I have a question, The gt label for captions how to gernerated ?

Great work,How to fintune

Thanks for your contribution, it is a great work, how to fintune my dataset? like mural dataset, can you give the command or script?

How are region-level descriptions obtained?

Thanks for your great work！
In your paper, the label of the semantic classification branch is mask cropped embedding obtained by CLIP, then how is the GT of the caption branch generated from SA-1B?

How to do global segmentation？

Thank you for your model.！I would like to know how your model performs global segmentation. Can the author open source the inference code for global segmentation in the future?

Inference from multiple points

When I change the number of the points in the Inference.ipynb :
inputs["points"] = np.array([[[1050.1, 900, 1],[0,0,4]],[[900,800,1],[0,0,4]],[[230, 900, 1],[0,0,4]],[[1120, 920, 1],[0,0,4]],[[1030, 900, 1],[0,0,4]],[[112, 93, 1],[0,0,4]],[[1022, 904, 1],[0,0,4]],[[1050,123, 1],[0,0,4]],[[50, 900, 1],[0,0,4]],], "float32")

Something goes wrong ，the details are shown as follows:
`RuntimeError Traceback (most recent call last)
Cell In[7], line 24
22 concepts, scores = model.predict_concept(sem_embeds[mask_index])
23 print(sem_tokens[mask_index][:, None, :].shape)
---> 24 captions = model.generate_text(sem_tokens[mask_index][:, None, :])
25 print(captions)
26 # Display comprehensive visual understanding.

File ~/anaconda3/envs/tap/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)

File ~/anaconda3/envs/tap/lib/python3.10/site-packages/tokenize_anything/modeling/image_tokenizer.py:189, in ImageTokenizer.generate_text(self, visual_tokens, max_gen_len, temperature)
187 decode_seq_len = cur_pos - prev_pos
188 x = torch.as_tensor(tokens[:, prev_pos:cur_pos], device=prompts.device)
--> 189 logits = self.text_decoder.transformer(prompts, x, prev_pos)
190 next_logits = logits[: x.size(0), decode_seq_len - 1]
191 if temperature > 0:

File ~/anaconda3/envs/tap/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
...
1215 num_splits,
1216 )
1217 return out

RuntimeError: seqlens_k must have shape (batch_size)`

question about Model D training

Awesome work, Congratulations!
I have some questions about the Model D training.
1、In this model，Pre-train with [Mask,Concept]，this concept means the text embeddings(2560 categories)? Than how get this concept to 1B masks?
2、In this paper, get 2.25TB image embedding. How use this data?

box prompt format？

Why Visual Prompt Decoding: Box
inputs["points"] = np.array([[[163, 237, 2], [682, 722, 3]]], "float32")

box prompt format？

something error happens when I update the version of triton

Wonderful works! While I have installed the flash-attn wheel, which version is 2.3.5 with cuda11.7, torch2.0.1, to successfully import flash_attn as you mentioned in issue2, I update the version of triton to 2.1.0. However, something goes wrong when I run the app_gradio.py in your folder, the details are shown as follows:

(tapmap) wjh@ps:/data/wjh/LVLM_pkgs/tokenize-anything$ CUDA_VISIBLE_DEVICES=6 python3 scripts/app_gradio.py --model-type tap_vit_b --checkpoint ./weights/tap_vit_b_b45cbf.pkl --concept ./weights/merged_2560.pkl
Running on local URL:  http://127.0.0.1:2030

To create a public link, set `share=True` in `launch()`.
ERROR: ld.so: object '/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so' from LD_PRELOAD cannot be preloaded (wrong ELF class: ELFCLASS64): ignored.
ERROR: ld.so: object '/lib64/libstdc++.so' from /etc/ld.so.preload cannot be preloaded (wrong ELF class: ELFCLASS64): ignored.
ERROR: ld.so: object '/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so' from LD_PRELOAD cannot be preloaded (wrong ELF class: ELFCLASS64): ignored.
ERROR: ld.so: object '/lib64/libstdc++.so' from /etc/ld.so.preload cannot be preloaded (wrong ELF class: ELFCLASS64): ignored.
/usr/lib/gcc/x86_64-linux-gnu/7/../../../../x86_64-linux-gnu/bin/ld: /usr/lib/gcc/x86_64-linux-gnu/7/liblto_plugin.so: error loading plugin: /usr/lib/gcc/x86_64-linux-gnu/7/liblto_plugin.so: wrong ELF class: ELFCLASS64
collect2: error: ld returned 1 exit status
Process Process-2:
Traceback (most recent call last):
  File "/home/wjh/anaconda3/envs/tapmap/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/wjh/anaconda3/envs/tapmap/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/wjh/anaconda3/envs/tapmap/lib/python3.10/site-packages/tokenize_anything/engine/test_engine.py", line 81, in run
    self.send_results(predictor, indices, examples)
  File "/home/wjh/anaconda3/envs/tapmap/lib/python3.10/site-packages/tokenize_anything/engine/test_engine.py", line 47, in send_results
    results = predictor.get_results(examples)
  File "/home/wjh/anaconda3/envs/tapmap/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/data/wjh/LVLM_pkgs/tokenize-anything/scripts/app_gradio.py", line 97, in get_results
    captions = self.model.generate_text(sem_tokens).reshape(batch_shape)
  File "/home/wjh/anaconda3/envs/tapmap/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/wjh/anaconda3/envs/tapmap/lib/python3.10/site-packages/tokenize_anything/modeling/image_tokenizer.py", line 188, in generate_text
    logits = self.text_decoder.transformer(prompts, x, prev_pos)
  File "/home/wjh/anaconda3/envs/tapmap/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wjh/anaconda3/envs/tapmap/lib/python3.10/site-packages/tokenize_anything/modeling/text_decoder.py", line 160, in forward
    x = blk(x)
  File "/home/wjh/anaconda3/envs/tapmap/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wjh/anaconda3/envs/tapmap/lib/python3.10/site-packages/tokenize_anything/modeling/text_decoder.py", line 134, in forward
    x = self.dropout(self.attn(self.norm1(x))).add_(x)
  File "/home/wjh/anaconda3/envs/tapmap/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wjh/anaconda3/envs/tapmap/lib/python3.10/site-packages/tokenize_anything/modeling/text_decoder.py", line 104, in forward
    q, k = self.cache.forward_rotary(q, k, inplace=True)
  File "/home/wjh/anaconda3/envs/tapmap/lib/python3.10/site-packages/tokenize_anything/modeling/text_decoder.py", line 73, in forward_rotary
    q = apply_rotary_emb(q, cos, sin, interleaved=True, inplace=inplace)
  File "/home/wjh/anaconda3/envs/tapmap/lib/python3.10/site-packages/flash_attn/layers/rotary.py", line 122, in apply_rotary_emb
    return ApplyRotaryEmb.apply(
  File "/home/wjh/anaconda3/envs/tapmap/lib/python3.10/site-packages/flash_attn/layers/rotary.py", line 48, in forward
    out = apply_rotary(
  File "/home/wjh/anaconda3/envs/tapmap/lib/python3.10/site-packages/flash_attn/ops/triton/rotary.py", line 213, in apply_rotary
    rotary_kernel[grid](
  File "<string>", line 63, in rotary_kernel
  File "/home/wjh/anaconda3/envs/tapmap/lib/python3.10/site-packages/triton/compiler/compiler.py", line 425, in compile
    so_path = make_stub(name, signature, constants)
  File "/home/wjh/anaconda3/envs/tapmap/lib/python3.10/site-packages/triton/compiler/make_launcher.py", line 39, in make_stub
    so = _build(name, src_path, tmpdir)
  File "/home/wjh/anaconda3/envs/tapmap/lib/python3.10/site-packages/triton/common/build.py", line 90, in _build
    ret = subprocess.check_call(cc_cmd)
  File "/home/wjh/anaconda3/envs/tapmap/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpmsxq3wqb/main.c', '-O3', '-I/home/wjh/anaconda3/envs/tapmap/lib/python3.10/site-packages/triton/common/../third_party/cuda/include', '-I/home/wjh/anaconda3/envs/tapmap/include/python3.10', '-I/tmp/tmpmsxq3wqb', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpmsxq3wqb/rotary_kernel.cpython-310-x86_64-linux-gnu.so', '-L/usr/lib/x86_64-linux-gnu', '-L/usr/lib/x86_64-linux-gnu']' returned non-zero exit status 1.

Error for FlashAttention

The code will report error when running with V100 GPU:
RuntimeError: FlashAttention only supports Ampere GPUs or newer.

Since I have only V100 GPU, how can I change the code? For example, o = self.cache.forward_flash(self, q, k, v)

class Attention(nn.Module):
"""Self-Attention layer."""

def __init__(self, dim, num_heads, bias=True):
    super(Attention, self).__init__()
    self.qkv = nn.Linear(dim, dim * 3, bias=bias)
    self.proj = nn.Linear(dim, dim, bias=bias)
    self.head_dim = dim // num_heads
    self.num_heads = num_heads
    self.scale = self.head_dim**-0.5
    self.dropout = nn.Dropout(0.1, inplace=False)
    self.cache = nn.Module()

def forward(self, x):
    qkv_shape = (-1, x.size(1), 3, self.num_heads, self.head_dim)
    q, k, v = self.qkv(x).view(qkv_shape).unbind(dim=2)
    q, k = self.cache.forward_rotary(q, k, inplace=True)
    o = self.cache.forward_flash(self, q, k, v)
    return self.proj(o.flatten(2))

class Attention(nn.Module):
"""Multi-head attention."""

def __init__(self, dim=256, num_heads=8, attn_ratio=1):
    super(Attention, self).__init__()
    qkv_dim = int(dim * attn_ratio)
    self.num_heads = num_heads
    self.head_dim = qkv_dim // num_heads
    self.q_proj = nn.Linear(dim, qkv_dim)
    self.k_proj = nn.Linear(dim, qkv_dim)
    self.v_proj = nn.Linear(dim, qkv_dim)
    self.proj = nn.Linear(qkv_dim, dim)
    self.scale = self.head_dim**-0.5

def forward(self, q, k, v):
    q = self.q_proj(q).view((-1, q.size(1), self.num_heads, self.head_dim))  
    k = self.k_proj(k).view((-1, k.size(1), self.num_heads, self.head_dim))    
    v = self.v_proj(v).view((-1, v.size(1), self.num_heads, self.head_dim))  
    o = flash_attn_func(q, k, v, softmax_scale=self.scale)             
    o = F.scaled_dot_product_attention(q, k, v, is_causal=True)[0] 
    return self.proj(o.flatten(2))

Training Code

Hi! This is a great work that can train the SAM from scratch. I wonder the whether training code will be released for further research. Or whether you can provide more training details on SAM data.

Caption branch

Thanks for your great work！
In your project, the caption branch is trained only on VG data. This caption ability may be poor than the modal using large caption data and large language model. Have you the plan using large caption data training this modal or other future work?