[HuggingFace Space] | [COLAB] | [Demo Video]
Meta released a new foundation model for segmentation tasks. It aims to resolve downstream segmentation tasks with prompt engineering, such as foreground/background points, bounding box, mask, and free-formed text. However, the text prompt is not released yet.
Alternatively, I took the following steps:
- Get all object proposals generated by SAM (Segment Anything Model).
- Crop the object regions by bounding boxes.
- Get cropped images' features and a query feature from CLIP.
- Calculate the similarity between image features and the query feature.
# How to get the similarity.
preprocessed_img = preprocess(crop).unsqueeze(0)
tokens = clip.tokenize(texts)
logits_per_image, _ = model(preprocessed_img, tokens)
similarity = logits_per_image.softmax(-1)
Anaconda is required before start setup.
make env
conda activate segment-anything-with-clip
make setup
# this executes GRadio server.
make run