Giter Club home page Giter Club logo

deep-learning-image-caption-generator's Introduction

Deep Learning Image Caption Generator

Deep CNN-LSTM for Generating Image Descriptions 😈

Key words: Image captioning, image description generator, explain image, merge model, deep learning, long-short term memory, recurrent neural network, convolutional neural network, word by word, word embeding, bleu score.

Related works: Deep model for computer vision and natural language, Image-sentence retrieval, Generating novel sentence descriptions for images.

Abstract

Image captioning is a very interesting problem in machine learning. With the development of deep neural network, deep learning approach is the state of the art of this problem. The main mission of image captioning is to automatically generate an image's description, which requires our understanding about content of images. In the past, there are some end-to-end models which were introduced such as: GoogleNIC (show and tell), MontrealNIC (show attend and tell), LRCN, mRNN, they are called inject-model with idea is give image feature throught RNN. In 2017, Marc Tanti, et al. introduce their paper What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator? with merge-model. The main idea of this model is separate CNN and RNN, with only merge their ouput at the end and predicted by softmax layer. Base on it, we develop our model to generate image caption.

I. Main Idea:

  • Combine ConvNet with LSTM
  • Deep ConvNet as image encoder
  • Language LSTM as text encoder
  • Fully connected layer as decoder
  • End-to-end model I -> S
  • Maximize P(S|I)

II. Dataset:

Flickr 8k, train/val/test 6:1:1.

The definitive description of the dataset is in the paper “Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics” from 2013.

The authors describe the dataset as follows:

"We introduce a new benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events … The images were chosen from six different Flickr groups, and tend not to contain any well-known people or locations, but were manually selected to depict a variety of scenes and situations."

Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics, 2013.

III. Baseline Model:

IV. Implement code:

  1. Load images and extract feature: kaggle-kernel
  2. Load text data: kaggle-kernel
  3. Develop model and training: kaggle-kernel
  4. Evaluation model: kaggle-kernel
  5. Generator caption for new images: kaggle-kernel

V. Tuning hyperparameters:

Encoder ConvNet:

  • VGG16
  • Resnet50
  • Densenet121
  • Inceptionv3

Optimizer

  • Adam:
  • Nadam:
  • RMSprop:
  • Sgd:

VI. Evaluation and result:

We use BLEU-score which is evaluate metric:

  • BLEU-1: 0.542805
  • BLEU-2: 0.301714
  • BLEU-3: 0.207351
  • BLEU-4: 0.095704

Caption of new images:

Report comming soon!

Reference

  • [1] Marc Tanti, Albert Gatt. Where to put the Image in an Image Caption Generator. arXiv preprint arXiv:1703.09137, 2018.
  • [3]Marc Tanti, Albert Gatt, Kenneth P. Camilleri. What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator? arXiv preprint arXiv:1708.02043, 2017.
  • [3] Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. arXiv preprint arXiv:1502.03044, 2016.
  • [4] Andrej Karpathy, Li Fei-Fei Deep Visual-Semantic Alignments for Generating Image Descriptions. arXiv preprint arXiv:1412.2306, 2015.
  • [5] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. arXiv preprint arXiv:1411.4555, 2014.
  • [6] Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, Serge Belongie. Learning to Evaluate Image Captioning.
  • [7] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Alan L. Yuille. Explain Images with Multimodal Recurrent Neural Networks. arXiv preprint arXiv:1410.1090, 2014..
  • [8] Xinlei Chen, C. Lawrence Zitnick. Learning a Recurrent Visual Representation for Image Caption Generation. arXiv preprint arXiv:1411.5654, 2016.

Happy trainning 🎉 and please vote ⭐ if it help!

deep-learning-image-caption-generator's People

Contributors

damminhtien avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

deep-learning-image-caption-generator's Issues

Evaluation

Đánh giá hiệu quả của mô hình!

Parameter tuning

THử các tham số khác nhau và lựa chọn mô hình phù hợp!

kaggle kernel for load and extract image is not correct

I believe the kaggle kernel link you provided for loading image and extracting image is not correct. The link leads to a code that defines the data generator slight differently than the code you provided in your developing model kaggle kernel. Though I found that the notebook reference for handling image data does provide an answer on how you get photo features saved up in features.pkl. So, any researcher/learner should find their way to what they needed. I just pointed it out if you have not done this intentionally, you may check the kaggle kernel link for loading photo and extracting features.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.