Implementation of Mirasol, SOTA Multimodal Autoregressive model out of Google Deepmind, in Pytorch
Will simply implement the Transformer Combiner and omit the other variants.
- StabilityAI, A16Z Open Source AI Grant Program, and ๐ค Huggingface for the generous sponsorships, as well as my other sponsors, for affording me the independence to open source current artificial intelligence research
$ pip install mirasol-pytorch
import torch
from mirasol_pytorch import Mirasol
model = Mirasol(
dim = 512,
num_text_tokens = 256,
video_image_size = 128,
video_frames_per_timechunk = 2,
audio_freq_dim = 64,
audio_time_dim_per_timechunk = 32,
audio_patch_size = (32, 16),
video_patch_size = (64, 2),
audio_encoder = dict(
dim = 512,
depth = 2
),
video_encoder = dict(
dim = 512,
depth = 2
)
)
audio = torch.randn(1, 64, 1024)
video = torch.randn(1, 3, 12, 128, 128)
text = torch.randint(0, 256, (1, 1024))
loss = model(
audio = audio,
video = video,
text = text
)
loss.backward()
# after much training
sampled_text = model.generate(
audio = audio,
video = video,
seq_len = 512
)
- text generation code
- auto-handle start token for decoder
- positional embeddings for video and audio encoder
- enable register tokens for both video and audio encoder, inline with new research
- add audio and video reconstruction losses
- add similarity regularization from TTS research
@article{Piergiovanni2023Mirasol3BAM,
title = {Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities},
author = {A. J. Piergiovanni and Isaac Noble and Dahun Kim and Michael S. Ryoo and Victor Gomes and Anelia Angelova},
journal = {ArXiv},
year = {2023},
volume = {abs/2311.05698},
url = {https://api.semanticscholar.org/CorpusID:265129010}
}
@inproceedings{Liu2022TowardsBF,
title = {Towards Better Few-Shot and Finetuning Performance with Forgetful Causal Language Models},
author = {Hao Liu and Xinyang Geng and Lisa Lee and Igor Mordatch and Sergey Levine and Sharan Narang and P. Abbeel},
year = {2022},
url = {https://api.semanticscholar.org/CorpusID:256416540}
}
@article{Darcet2023VisionTN,
title = {Vision Transformers Need Registers},
author = {Timoth'ee Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},
journal = {ArXiv},
year = {2023},
volume = {abs/2309.16588},
url = {https://api.semanticscholar.org/CorpusID:263134283}
}
@article{Bondarenko2023QuantizableTR,
title = {Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing},
author = {Yelysei Bondarenko and Markus Nagel and Tijmen Blankevoort},
journal = {ArXiv},
year = {2023},
volume = {abs/2306.12929},
url = {https://api.semanticscholar.org/CorpusID:259224568}
}
@misc{shi2023enhance,
title = {Enhance audio generation controllability through representation similarity regularization},
author = {Yangyang Shi and Gael Le Lan and Varun Nagaraja and Zhaoheng Ni and Xinhao Mei and Ernie Chang and Forrest Iandola and Yang Liu and Vikas Chandra},
year = {2023},
eprint = {2309.08773},
archivePrefix = {arXiv},
primaryClass = {cs.SD}
}