This Repository is an implementation of An Image is worth 16x16 words. A paper that came out in 2020 which used the highly successful transformer models from natural language processing for computer vision tasks. This implementation is purely in PyTorch.
- Getting Started
- Usage
- Module details
The ViT model would require an installation of PyTorch to run.
To clone this repository locally use the following command in the CLI:
!git clone https://github.com/p4arth/Replicating-ViT.git
Import the ViT (Vision Transformer) module which is under modules.vit
.
from modules.vit import ViT
# Initializing the model
model = ViT()
The modules folder contains 5 submodules that altogether form the vision transformer model.
-
This module contains the patch embeddings class which is used in the paper to turn and image into patches of size 16x16. The patch embeddings are then flattened and passed onto the transformer encoder block.
-
Multi-Headed Self Attention (MSA)
This module contains the multi-headed self attention block which resides inside the transformer encoder. The block applies a series of attention heads to the input provided.
-
Multi-Layer Perceptron (MLP)
This module proceeds the multi-headed self attention block and contains a multi-layer perceptron, also called a dense layer.
-
This module combines both MSA and the MLP block to together form the transformer encoder layer. The input to this layer is flattened patches of an image which goes through a series of transformation in the MSA and MLP blocks.
-
This module implements the Vision Transformer model.