Image Captioning is the process of generating textual description of an image. It uses both Natural Language Processing and Computer Vision to generate the captions. The dataset will be in the form [image โ captions]. The dataset consists of input images and their corresponding output captions.
- Common Objects in Context (COCO). A collection of more than 120 thousand images with descriptions
- Flickr 8K. A collection of 8 thousand described images taken from flickr.com.
- Flickr 30K. A collection of 30 thousand described images taken from flickr.com.