Giter Club home page Giter Club logo

x-lxmert's Introduction

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers (EMNLP 2020)

Summary

Recent multi-modal transformers have achieved tate of the art performance on a variety of multimodal discriminative tasks like visual question answering and generative tasks like image captioning. This begs an interesting question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular representative from this model family - LXMERT - finds that it is unable to generate rich and semantically meaningful imagery with its current training setup. We introduce X-LXMERT, an extension to LXMERT with training refinements. X-LXMERT's image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT.

Demo

Try out AI2 Computer Vision Explorer Demo!

Install

  • Python packages
conda create -n xlxmert python=3.7
conda activate xlxmert
cd  x-lxmert
pip install -r ./requirements.txt

Code structure

# Store images, features, and annotations
./datasets
    COCO/
        images/
        featuers/
    VG/
        images/
        features/
    GQA/
        images/
        features/
    nlvr2/
        images/
        features/
    data/               <= Store text annotations (*.json) for each split
        lxmert/
        vqa/
        gqa/
        nlvr2/

# Run feature extraction and k-means clustering
./feature_extraction

# Train image generator
./image_generator
    snap/       <= Store image generator checkpoints
    scripts/    <= Bash scripts for training image generator

# Train X-LXMERT
./x-lxmert
    src/
        lxrt/           <= X-LXMERT model class implementation (inherits huggingface transformers' LXMERT class)
        pretrain/       <= X-LXMERT Pretraining
        tasks/          <= Fine-tuning on downstream tasks (VQA, GQA, NLVR2, Image generation)
    snap/       <= Store X-LXMERT checkpoints
    scripts/    <= Bash scripts for pretraining, fine-tuning, and image generation

Feature extraction

Please checkout ./feature_extraction for download pre-extracted features and more details.

cd ./feature_extraction

# For Pretraining / VQA
python coco_extract_grid_feature.py --split train
python coco_extract_grid_feature.py --split valid
python coco_extract_grid_feature.py --split test

# For Pretraining
python VG_extract_grid_feature.py

# For GQA
python GQA_extract_grid_feature.py

# For NLVR2
python nlvr2_extract_grid_feature.py --split train
python nlvr2_extract_grid_feature.py --split valid
python nlvr2_extract_grid_feature.py --split test

# K-Means clustering
python run_kmeans.py --src mscoco_train --tgt mscoco_train mscoco valid vg

Pretraining

Pretrain on LXMERT Pretraining data

cd ./x-lxmert/
bash scripts/pretrain.bash

or download pretrained checkpoint

wget -O x-lxmert/snap/pretrained/x_lxmert/Epoch20_LXRT.pth https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/x-lxmert/Epoch20_LXRT.pth

Finetuning

VQA

cd ./x-lxmert/
bash scripts/finetune_vqa.bash
bash scripts/test_vqa.bash

GQA

cd ./x-lxmert/
bash scripts/finetune_gqa.bash
bash scripts/test_gqa.bash

NLVR2

cd ./x-lxmert/
bash scripts/finetune_nlvr2.bash
bash scripts/test_nlvr2.bash

Image generation

Train image generator on MS COCO

cd ./image_generator/
bash scripts/train_generator.bash

or download pretrained checkpoints

wget -O image_generator/snap/pretrained/G_60.pth https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/image_generator/G_60.pth

Sample images

cd ./x-lxmert/
bash scripts/sample_image.bash

Reference

@inproceedings{Cho2020XLXMERT,
  title={X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers},
  author={Cho, Jaemin and Lu, Jiasen and Schwenk, Dustin and Hajishirzi, Hannaneh and Kembhavi, Aniruddha},
  booktitle={EMNLP},
  year={2020}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.