Giter Club home page Giter Club logo

anything2image's Introduction

Anything To Image

PyPI

Generate image from anything with ImageBind's unified latent space and stable-diffusion-2-1-unclip.

TODO: Currently, we only support ImageBind-Huge with 1024 latent space. However, it might be possible to use StableDiffusionImageVariation for 768 latent space.

We need at least 22 Gb GPU memory for the demo. Therefore gradio and colab online demo might need pro account to obtain more GPU/memory to run them.

Support Tasks

Update

[2023/5/19]:

  • Anything2Image has been integrated into InternGPT.
  • [v1.1.4]: Support fusing audio and text in ImageBind latent space and UI improvements.

[2023/5/18]

  • [v1.1.3]: Support thermal to image.
  • [v1.1.0]: Gradio GUI - add options for controling image size, and noise scheduler.
  • [v1.0.8]: Gradio GUI - add options for controling noise level, audio-image embedding arithmetic strength, and number of inference steps.
anything2image.mp4

Getting Started

Requirements

Ensure you have PyTorch installed.

  • Python >= 3.8
  • PyTorch >= 1.13

Then install the anything2image.

# from pypi
pip install anything2image
# or locally install via git clone
git clone [email protected]:Zeqiang-Lai/Anything2Image.git
cd Anything2Image
pip install .

Usage

# lanuch gradio demo
python -m anything2image.app
# command line demo, see also the tasks examples below.
python -m anything2image.cli --audio assets/wav/cat.wav

Audio to Image

bird_audio.wav dog_audio.wav cattle.wav cat.wav
fire_engine.wav train.wav motorcycle.wav plane.wav
python -m anything2image.cli --audio assets/wav/cat.wav

See also audio2img.py.

import anything2image.imagebind as ib
import torch
from diffusers import StableUnCLIPImg2ImgPipeline

# construct models
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = StableUnCLIPImg2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16
).to(device)
model = ib.imagebind_huge(pretrained=True).eval().to(device)

# generate image
with torch.no_grad():
    audio_paths=["assets/wav/bird_audio.wav"]
    embeddings = model.forward({
        ib.ModalityType.AUDIO: ib.load_and_transform_audio_data(audio_paths, device),
    })
    embeddings = embeddings[ib.ModalityType.AUDIO]
    images = pipe(image_embeds=embeddings.half()).images
    images[0].save("audio2img.png")

Audio+Text to Image

cat.wav cat.wav bird_audio.wav bird_audio.wav
A painting A photo A painting A photo
python -m anything2image.cli --audio assets/wav/cat.wav --prompt "a painting"

See also audiotext2img.py.

with torch.no_grad():
    audio_paths=["assets/wav/bird_audio.wav"]
    embeddings = model.forward({
        ib.ModalityType.AUDIO: ib.load_and_transform_audio_data(audio_paths, device),
    })
    embeddings = embeddings[ib.ModalityType.AUDIO]
    images = pipe(prompt='a painting', image_embeds=embeddings.half()).images
    images[0].save("audiotext2img.png")

Audio+Image to Image

Audio & Image Output Audio & Image Output
wave.wav wave.wav
python -m anything2image.cli --audio assets/wav/wave.wav --image "assets/image/bird.png"
with torch.no_grad():
    embeddings = model.forward({
        ib.ModalityType.VISION: ib.load_and_transform_vision_data(["assets/image/bird.png"], device),
    })
    img_embeddings = embeddings[ib.ModalityType.VISION]
    embeddings = model.forward({
        ib.ModalityType.AUDIO: ib.load_and_transform_audio_data(["assets/wav/wave.wav"], device),
    }, normalize=False)
    audio_embeddings = embeddings[ib.ModalityType.AUDIO]
    embeddings = (img_embeddings + audio_embeddings)/2
    images = pipe(image_embeds=embeddings.half()).images
    images[0].save("audioimg2img.png")

Image to Image

Top: Input Images. Bottom: Generated Images.

python -m anything2image.cli --image "assets/image/bird.png"

See also img2img.py.

with torch.no_grad():
    paths=["assets/image/room.png"]
    embeddings = model.forward({
        ib.ModalityType.VISION: ib.load_and_transform_vision_data(paths, device),
    }, normalize=False)
    embeddings = embeddings[ib.ModalityType.VISION]
    images = pipe(image_embeds=embeddings.half()).images
    images[0].save("img2img.png")

Text to Image

A photo of a car. A sunset over the ocean. A bird's-eye view of a cityscape. A close-up of a flower.

It is not necessary to use ImageBind for text to image. Nervertheless, we show the alignment of ImageBind's text latent space and its image spaces.

python -m anything2image.cli --text "A sunset over the ocean."

See also text2img.py.

with torch.no_grad():
    embeddings = model.forward({
        ib.ModalityType.TEXT: ib.load_and_transform_text(['A photo of a car.'], device),
    }, normalize=False)
    embeddings = embeddings[ib.ModalityType.TEXT]
    images = pipe(image_embeds=embeddings.half()).images
    images[0].save("text2img.png")

Thermal to Image

Input Output Input Output

Top: Input Images. Bottom: Generated Images.

python -m anything2image.cli --thermal "assets/thermal/030419.jpg"

See also thermal2img.py.

with torch.no_grad():
    thermal_paths =['assets/thermal/030419.jpg']
    embeddings = model.forward({
        ib.ModalityType.THERMAL: ib.load_and_transform_thermal_data(thermal_paths, device),
    }, normalize=True)
    embeddings = embeddings[ib.ModalityType.THERMAL]
    images = pipe(image_embeds=embeddings.half()).images
    images[0].save("thermal2img.png")

Citation

Latent Diffusion

@InProceedings{Rombach_2022_CVPR,
    author    = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn},
    title     = {High-Resolution Image Synthesis With Latent Diffusion Models},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2022},
    pages     = {10684-10695}
}

ImageBind

@inproceedings{girdhar2023imagebind,
  title={ImageBind: One Embedding Space To Bind Them All},
  author={Girdhar, Rohit and El-Nouby, Alaaeldin and Liu, Zhuang
and Singh, Mannat and Alwala, Kalyan Vasudev and Joulin, Armand and Misra, Ishan},
  booktitle={CVPR},
  year={2023}
}

anything2image's People

Contributors

zeqiang-lai avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.