Giter Club home page Giter Club logo

mvlt-masked-vision-language-transformer-in-fashion's Introduction

Masked Vision-Language Transformer in Fashion

  • Authors: Ge-Peng Ji^, Mingcheng Zhuge^, Dehong Gao, Deng-Ping Fan#, Christos Sakaridis, and Luc Van Gool
  • Accepted by Machine Intelligence Research 2023
  • Link: arXiv Paper
  • This project is still work in progress, and we invite all to contribute in making it more acessible and useful. If you have any questions, please feel free to drop us an e-mail ([email protected] & [email protected] & [email protected]) or directly report it in the issue or push a PR.
  • Your star is our motivation, let's enjoy it!

Dataset Preparation

This project conducts several experiments on the public dataset, Fashion-Gen, which contains 260,480 training text-image pairs for training and 35,528 text-image pairs for inference. The MVLT model can directly process the original image and text without any feature engineering pre-processing of the data. However, it is necessary to sort out the storage form of the data to facilitate the dataloader of torch:

Please download the reorganized dataset from OneDrive.

Prelimilaries

Installing the basic libararies python3.6, pytorch1.8, cuda10.1 on UBUNTU18.04. I did validate the flexibilty on other versions of libraries and systems, but I think it is easy to adaptation with minor changes.

  • Create env via conda create -n MVLT python=3.6
  • Installing Pytorch via ~/miniconda3/envs/MVLT/bin/python3.6 -m pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
  • Installing the auxiliary libraries via running ~/miniconda3/envs/MVLT/bin/python3.6 -m pip install -r requirements.txt
  • The checkpoint of PVT-tiny for pre-training is prepared at ./preweights/pvt_v1/pvt_tiny.pth. You can also try other PVT-v1 and PVT-v2 variants (download link) to further boost the performance if enough GPU memory is available for you.
  • Downloading the checkpoint from OneDrive and move them into ./checkpoints/. Note this tar.gz file contains two weights: checkpoint_retrieval.pth and checkpoint_recognition.pth.

Training

Please note that we only use PVT-Tiny to learn multi-modal features, and other stronger backbone would further improve representation abilities, such as SwinTransformer or PVTv2.

  • Please revise your data path (--data-path parameter) in ./scripts_dws/dws_mvlt_exp21.sh or ./scripts_dws/dws_mvlt_ft_exp48.sh
  • Just run bash ./scripts_dws/dws_mvlt_exp21.sh for pre-training
  • Just run bash ./scripts_dws/dws_mvlt_ft_exp48.sh for fine-tuning

Inference

  • Downstream retrieval tasks

    • We provide the zero-shot retrieval performance without any finetuning process, and thus, the well-trained weight could be directly used in the retrieval tasks.
    • Just run bash downstream_retrieval.sh and then get the prediction results of Image-Text Retrieval (ITR) and Text-Image Retrieval (TIR).
      • Text-Image Retrieval (TIR): acc@1: 0.346, acc@5: 0.780, acc@10: 0.895
      • Image-Text Retrieval (ITR): acc@1: 0.331, acc@5: 0.772, acc@10: 0.911
  • Downstream recognition tasks

    • This task needs the fine-tuning process because our pre-trained model is not equipped with the classification head.
    • Just run bash downstream_recognition.sh and then get the prediction results of Main-Category Recognition (M-CR) and Sub-Category Recognition (S-CR).
      • Main-category recognition (M-CR): accuracy (0.9825996064928677) macro_f1 (0.8954719842489123)
      • Sub-category recognition (S-CR): accuracy (0.9356554353172651) macro_f1 (0.8285927576055913)

Citation

@article{ji2022masked,
  title={Masked Vision-Language Transformer in Fashion},
  author={Ji, Ge-Peng and Zhuge, Mingchen and Gao, Dehong and Fan, Deng-Ping and Sakaridis, Christos and Van Gool, Luc},
  journal={Machine Intelligence Research},
  year={2022}
}

Here are two concurrent works from Alibaba ICBU Team.

@inproceedings{zhuge2021kaleido,
  title={Kaleido-bert: Vision-language pre-training on fashion domain},
  author={Zhuge, Mingchen and Gao, Dehong and Fan, Deng-Ping and Jin, Linbo and Chen, Ben and Zhou, Haoming and Qiu, Minghui and Shao, Ling},
  booktitle={CVPR},
  pages={12647--12657},
  year={2021}
}

@inproceedings{10.1145/3397271.3401430,
  author = {Gao, Dehong and Jin, Linbo and Chen, Ben and Qiu, Minghui and Li, Peng and Wei, Yi and Hu, Yi and Wang, Hao},
  title = {FashionBERT: Text and Image Matching with Adaptive Loss for Cross-Modal Retrieval},
  year = {2020},
  publisher = {Association for Computing Machinery},
  booktitle = {Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval},
  pages = {2251โ€“2260},
  numpages = {10},
  location = {Virtual Event, China},
  series = {SIGIR '20}
}

Acknowlegement

Thanks Alibaba ICBU Search Team and Wenhai Wang (PVT) for their technical support.

mvlt-masked-vision-language-transformer-in-fashion's People

Contributors

dengpingfan avatar gewelsji avatar mczhuge avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.