Pytorch implementation of image captioning using transformer-based model.

License: MIT License

Python 20.87% Jupyter Notebook 79.13%

pytorch pytorch-implementation image-captioning transformers transformers-models transformer-pytorch mscoco-dataset encoder-decoder beam-search

image_captioning_with_transformers's Introduction

Mohamed Zarzoura

🔥 About Me:

I am intersted in NLP, NLU and multimodal learning based on language processing.
A final year master's student in language technology at Gothenburg University.

💼 Projects:

🔗 Image captioning using a transformer-based model, based on arXiv:2101.10804.
🔗 Relations extraction using sequence on tree structure LSTMs, based on 10.18653/v1/P16-1105.
🔗 Embedding textual spatial language (ongoing) (based on google paper arXiv:1807.01670).
🔗 A bot that plays the word gaming Ghost with a user. The bot is implemented using using Rasa.
🔗 A bot assistant that helps a user in interfacing with Zotero. The user can add papers and query about the items in their database. The bot was implemented using a proprietary tool called TDM.

🛠️ Languages and Tools:

Programming:
ML/DL:
NLP:
NLU:

image_captioning_with_transformers's People

Contributors

Stargazers

Watchers

Forkers

yangyanggit89 lpsunny eenzeenee felipezeiser huanyu2019 lucasantagata mvandermeulen syedmaazsaeed exocetfalling shreyafton87

image_captioning_with_transformers's Issues

Confusion about "vector_dir" when preparing dataset

Thanks for your great work! I'm trying to reproduce your results but stucked on running the create_dataset.py. I have read through the instructions on preparing datasets but got confused about set the "vector_dir" as Directory to the pre-trained embedding vectors files. Where can I get the files or zips. Have I missed some important part of the instructions?

Error run the train command

Hi @zarzouram I have some question about your work.

I have a question about running the create_dataset. I'm using the following command:

python code/create_dataset.py --dataset_dir dataset_dir --json_train json_train --json_val json_val --image_train image_train --image_val image_val --output_dir output_dir --vector_dir vector_dir --vector_dim 300 --min_freq 5 --max_len 52

However, when I execute this command, I encounter the following error:

Traceback (most recent call last):
File "code/create_dataset.py", line 56, in
vector_name = f"{vector_name[0].name.strip('.zip')}.{args.vector_dim}d"
IndexError: list index out of range

Where to obtain the validation annotations since there is only one file in the JSON.

Would you be so kind as to assist me in resolving this matter?

Not using positional encoding

Why are you not using positional encoding in the encoder of the network??

ValueError: not enough values to unpack (expected 3, got 0)

Hi, I was trying to inference some captions, and I encountered bug shown below.

As shown above, there are some captions generated, thus my code is able to inference some captions.
For the checkpoint I wished to inference, it always broke when inferencing 16th image.
Precisely, the source code locates at line 228 at image_captioning_with_transformers/code/inference_test.py

By the way, thanks a lot for your repo.

How can you measure the mean and standard deviation value ?

Thanks for sharing the code , excuse me if I need to measure the the mean and standard deviation value for the model with different initialization of the weights , Does that mean I need to train the model with different learning rates ? or I need to train the model three times for example as each time I train the model i got different accuracy but is a little different and evaluate the model after each time and get BLEU scores then calculate the average of the scores ?

I need help with transformer's output, attns.

Hello. I am so grateful that you wrote this implementation code.

I want to use this code to learn the model with the coco dataset, but I have a problem, so I leave a question.
While using the "run_train.py" code, an error such as a photo occurred.

At this time, attns should look like [layer_num, head_num, batch_size, max_len, code_size^2], but in my result, it appears that the dimension corresponding to [layer_num, batch_size, max_len, code_size^2] has disappeared.

To solve this problem, efforts were made to look at "models/IC_encoder_decoder/transformer.py", which seems to be a problem that occurs when attns is created through a module called layer. I'm leaving a question because I was wondering if there was a way to solve this problem.

I'll be waiting for your reply.

In addition, due to lack of English skills, the tone of the question may be unpleasant. I ask for your generous understanding of this.

Questions regarding the use of custom datasets

Hi,

I'm trying to use a dataset that has only one caption per image, however I'm having trouble finding where you handle the captions for use in training. How can I pass a different dataset to your code?

zarzouram / image_captioning_with_transformers Goto Github PK