Giter Club home page Giter Club logo

unilseg's Introduction

Universal Segmentation at Arbitrary Granularity with Language Instruction

Yong Liu, Cairong Zhang, Yitong Wang, Jiahao Wang, Yujiu Yang, Yansong Tang

The repository contains the official implementation of "Universal Segmentation at Arbitrary Granularity with Language Instruction"[CVPR 2024]

Paper

PWC
PWC
PWC
PWC
PWC

📖 Abstract

This paper aims to achieve universal segmentation of arbitrary semantic level. Despite significant progress in recent years, specialist segmentation approaches are limited to specific tasks and data distribution. Retraining a new model for adaptation to new scenarios or settings takes expensive computation and time cost, which raises the demand for versatile and universal segmentation model that can cater to various granularity. Although some attempts have been made for unifying different segmentation tasks or generalization to various scenarios, limitations in the definition of paradigms and input-output spaces make it difficult for them to achieve accurate understanding of content at arbitrary granularity. To this end, we present UniLSeg, a universal segmentation model that can perform segmentation at any semantic level with the guidance of language instructions. For training UniLSeg, we reorganize a group of tasks from original diverse distributions into a unified data format, where images with texts describing segmentation targets as input and corresponding masks are output. Combined with a automatic annotation engine for utilizing numerous unlabeled data, UniLSeg achieves excellent performance on various tasks and settings, surpassing both specialist and unified segmentation models.


📖 Pipeline

We have open-sourced the general inference code and UniLSeg-20 model weights (w/o finetuned on specified task dataset). If you find any bugs due to carelessness on our part in organizing the code, feel free to contact us and point that!

Installation

Install required packages.

conda create -n UniLSeg python=3.7
conda activate UniLSeg
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge -y
pip install -r requirements.txt

Usage

  • Pretrained Weight

    We have provided the pretrained UniLSeg-20 model weights (w/o finetuned on specified task dataset) and other pre-trained backbone weights. Please download them from here and put them under the current path.

General Inference

You can run the general inference by the following command:

python general_inference.py  --img <IMG_PATH> --exp <'EXPRESSION'> --sp <MASK_SAVE_PATH>

Cite

If you find our work helpful, we'd appreciate it if you could cite our paper in your work.

@article{liu2023universal,
  title={Universal Segmentation at Arbitrary Granularity with Language Instruction},
  author={Liu, Yong and Zhang, Cairong and Wang, Yitong and Wang, Jiahao and Yang, Yujiu and Tang, Yansong},
  journal={arXiv preprint arXiv:2312.01623},
  year={2023}
}

unilseg's People

Contributors

yongliu20 avatar

Stargazers

 avatar Eternal_LD avatar Daniel Abib avatar  avatar  avatar Benedictus Kent avatar Mustard Bean avatar  avatar  avatar  avatar  avatar Jiacheng Zhong avatar  avatar  avatar  avatar Wenqi Zhu avatar Guogq avatar LuoJianPing avatar zero-zzx avatar small bird avatar  avatar yzj2019 avatar MengzhangLi avatar GreatLuis avatar wuyujack (Mingfu Liang) avatar  avatar Joshua Fuller avatar  avatar Xuchen Li (李旭宸) avatar Debug_Yann avatar RoadoneP avatar Kashu Yamazaki avatar Yan Wang avatar  avatar Haoran Wang avatar Wu Jie avatar lg(x) avatar Licong Guan avatar Sanctuary avatar  avatar zhangtao avatar wuwuzzz avatar Malaz Tamim avatar Wenliang Guo avatar  avatar Tiezheng_Zhang avatar Joe Nevaeh avatar Nick Imanzi avatar FeiiYin avatar QixinHu avatar yahooo avatar Jeff Carpenter avatar Munan Ning avatar tong wang avatar Jinpeng Liu avatar  avatar Jiahao Wang avatar sokazaki avatar Jiangpengtao avatar 爱可可-爱生活 avatar Jixuan Fan avatar Xiaoke Huang avatar Wenbo Hu avatar Sule Bai | 白苏乐 avatar  avatar Xubing Ye avatar  avatar Eason (YICHENG XIAO) avatar oyly avatar  avatar Robert Luo avatar Qingyan Bai avatar

Watchers

 avatar  avatar

unilseg's Issues

I get poor result

Is there anything I need to adjust?
Why is the result I get different form yours so much?
e94387e1e6219e0cb39d00dc5e87165
f28538a7ad65851e8ce0a2100afab36

代码问题

  if not self.cfg.aux_loss:
      pred = torch.bmm(query_output, pixel_output.flatten(2)) 
      pred = rearrange(pred, 'b l (h w) -> b l h w', h=h, w=w)   
  else:
      for l, q in enumerate(query_output):
          final_output = []
          pred = torch.bmm(query_output[l], pixel_output.flatten(2))
          pred = rearrange(pred, 'b l (h w) -> b l h w', h=h, w=w)
          final_output.append(pred)
  return pred.detach()

请问这里为什么会输出最后pred,其他5个pred得作用是什么呢?
麻烦您解决我的困惑,非常感谢!!!!

AP for PartSeg

Hi @workforai et al.,

thx for ur cvpr'24 work. for part segmentation, may i ask if the conventional ap metric (apart from iou in the paper) could be reported as well? looking fwd to the code & ckpt. thx & best,

分割图像中的所有物体

注意到在bpe_simple_vocab_16e6.txt.gz文件中没有发现object、all objects等提示词,想分割图像中的所有物体 提示词写什么

Evaluation on Semantic and Open-Vocabulary Segmentation

Thank you for your outstanding work! The model seems to be more tailored towards Referring Image Segmentation, and I'm still somewhat confused about testing for Semantic Segmentation (SS) and Open-Vocabulary Segmentation (OVS). Although the paper mentions that "Semantic segmentation and open-vocabulary segmentation can be reformulated as language-guided paradigm by replacing output layers with computing the similarity between visual and linguistic embeddings," the process still appears unclear to me.

From what I understand, the model seems to output a mask by calculating the similarity between the activated visual features and content-aware linguistic embedding. However, I'm unsure how this is evaluated in SS or OVS. Here's my guess:

For example, in Open-Vocabulary Segmentation, for a given image, we need to identify which categories are present (say, M categories). Then, for each category, the similarity calculation is performed between the activated visual features and content-aware linguistic embedding, ultimately outputting M masks. These masks are then merged to create the final semantic segmentation map.

Could you please confirm if this understanding is correct? If not, could you provide more details on how the model operates for these tasks?

Thank you for your assistance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.