Giter Club home page Giter Club logo

lens-training's Introduction

LENS🔍

[Blog] [Demo] [Paper] [Colab]

This is the official repository for the LENS (Large Language Models Enhanced to See) paper.

Setup

  1. We recommend that you get a machine with GPUs and CUDA. A machine with a single GPU or even a CPU works, although for large datasets you should get several GPUs.

  2. Create a python 3.9 conda or virtual environment and then install this repo as a pip package with:

    pip install llm-lens

Usage

First, the system runs a series of highly descriptive vision modules on your own dataset of images. These modules provide a large set of natural language captions, tags, objects, and attributes for each image. These natural language descriptions are then provided to a large language model (LLM) so that it can solve a variety of tasks relating to the image. Despite the simplicity of our system, and the fact that it requires no finetuning, we demonstrate in our paper that it often performs better than other SOTA image-language models such as Flamingo, CLIP, and Kosmos.

Visual Descriptions

To simply get visual descriptions from LENS to pass on to an LLM, use the following code:

import requests
from lens import Lens, LensProcessor
from PIL import Image
import torch
img_url = 'https://images.unsplash.com/photo-1465056836041-7f43ac27dcb5?w=720'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
question = "What is the image about?"

lens = Lens()
processor = LensProcessor()
with torch.no_grad():
    samples = processor([raw_image],[question])
    output = lens(samples)
    prompts = output["prompts"]

The generated prompts can be passed to an LLM for solving a vision task. The output object also contains other useful information (tags, attributes, objects) that can be used to generate your custom prompts.

Advanced Use Cases

  • Pass the image descriptions to a language model, and ask it a question
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small",truncation_side = 'left',padding = True)
LLM_model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")

input_ids = tokenizer(samples["prompts"], return_tensors="pt").input_ids
outputs = LLM_model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
  • You can also augment your huggingface dataset
from datasets import load_dataset
from lens import Lens, LensProcessor

lens = Lens()
processor = LensProcessor()
ds = load_dataset("llm-lens/lens_sample_test", split="test")   
output_ds = lens.hf_dataset_transform(ds, processor, return_global_caption = False)

Architecture

LENS for open ended visual question answering

lens_open_ended

LENS for image classification

lens_close_ended

Coming Soon

We are working on bringing the complete LENS experience to you and following scripts will be added soon:

  • Evaluation on VQA and other datasets present in the paper.
  • Generating vocabularies used in the paper.
  • Other scripts needed to reproduce the paper.

Citation

@misc{berrios2023language,
      title={Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language}, 
      author={William Berrios and Gautam Mittal and Tristan Thrush and Douwe Kiela and Amanpreet Singh},
      year={2023},
      eprint={2306.16410},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.