Giter Club home page Giter Club logo

clip_dual_encoder's Introduction

Visual and Vision-Language Representation Pre-Training with Contrastive Learning

In this repository, I present my approach as a family of vision-language foundation systems. These systems, which are considered among the most advanced in the field of artificial intelligence, are used to solve a variety of important tasks, such as generation, retrieval, and classification tasks. The basis of the vision-language systems is a combination of pre-trained encoder models of computer vision and natural language processing.

Setup

I used Python 3.7 and PyTorch 1.7.1 to train and test my models. It also requires the installation of timm and transformers packages with the following versions:

$ pip install timm==0.6.7
$ pip install transformers

Training the model:

python main.py

During this research, I trained our models on several image encoder models, such as deit3, efficientnet_b8, convnext, swinv2, and fbnetv3, as well as text encoder models, such as roberta, xlm-roberta, xlnet, albert, electra, and bert, on a small-scale dataset. I concluded that changing image and text encoder models would necessarily change the efficiency of the model. I was able to show that encoder models like ConvNeXt and RoBERTa produce better results than more popular encoder models like BERT and ViT. I also found that training our model on a small-scale data set produced accurate results.

Below in this table are the best results I got after testing more than 80 pairs of image and text encoders in 20 epochs.

Image and Text Encoders Top-1 accuracy Top-5 accuracy
Swin + BERT 13.99 31.65
Swin + ALBERT 13.28 29.04
Swin + RoBERTa 14.83 32.64
ConvNeXt + BERT 16.63 35.31
ConvNeXt + ALBERT 13.46 29.46
ConvNeXt + RoBERTa 17.31 35.31

The model composed by the ConvNeXt and RoBERTa encoders has achieved satisfactory results on image-text retrieval tasks on some datasets, such as the ImageNetV2 dataset, the Unsplash dataset, and more other datasets, using different queries (textual queries, visual queries, and visual + textual queries) as shown in these notebooks Search_In_Unsplash.ipynb and Image_To_Text_Search.ipynb.It also achieved these results on the video-text retrieval task for videos from YouTube Test_video.ipynb.This model has outperformed many other models in image classification tasks on some datasets, such as Food-101, CIFAR-10, CIFAR-100, Describable Textures (DTD), Oxford-IIIT Pets (Pets), MNIST, STL-10, the German Traffic Sign Recognition Benchmark (GTSRB), and Rendered SST2 (SST) CR_classification.ipynb.Also, I tested this model on zero-shot image classification tasks as it showed its efficiency in some datasets, such as the (CIFAR-100 classes with some random images), the (puppy and bagel), and the (chihuahua and muffin) datasets Zero-Shot Image Classification.ipynb.

I did this by applying the contrastive learning method to the image-text pair dataset, which was used to train my models. Then I used the Zero-Shot method to retrieve images or texts. My model achieved all of these results, considering the size of the Flickr30k dataset and the number of epochs used in training.

clip_dual_encoder's People

Contributors

saadkh1 avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.