This is a curated list of audio-visual learning methods and datasets, based on our survey: <Learning in Audio-visual Context: A Review, Analysis, and New Perspective>. This list will continue to be updated, please feel free to nominate good related works with Pull Requests!
[Website of Our Survey], [arXiv]
- Overview
- Table of contents
- Audio-visual Boosting
- Cross-modal Perception
- Audio-visual Collaboration
- Datasets
- Audio-visual Speech Recognition Using Deep Learning, 2015
- Temporal Multimodal Learning in Audiovisual Speech Recognition, 2016
- End-To-End Audiovisual Fusion With LSTMs, 2017
- Deep Audio-visual Speech Recognition, 2018
- Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection, 2019
- Multimodal Sparse Transformer Network for Audio-visual Speech Recognition, 2022
- Robust Self-Supervised Audio-V\visual Speech Recognition, 2022
- Audio-visual Speaker Diarization Using Fisher Linear Semi-discriminant Analysis, 2016
- Audio-visual Person Recognition in Multimedia Data From the Iarpa Janus Program, 2018
- Noise-tolerant Audio-visual Online Person Verification Using an Attention-based Neural Network Fusion, 2019
- Who Said That?: Audio-visual Speaker Diarisation Of Real-World Meetings, 2019
- Self-Supervised Learning for Audio-visual Speaker Diarization, 2020
- A Multi-View Approach to Audio-visual Speaker Verification, 2021
- Audio-visual Deep Neural Network for Robust Person Verification, 2021
- Exploring Multimodal Video Representation For Action Recognition, 2016
- The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary, 2018
- EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition, 2019
- SCSampler: Sampling Salient Clips From Video for Efficient Action Recognition, 2019
- Listen to Look: Action Recognition by Previewing Audio, 2020
- Audiovisual SlowFast Networks for Video Recognition, 2020
- AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition, 2021
- Cross-Domain First Person Audio-Visual Action Recognition through Relative Norm Alignment, 2021
- Domain Generalization Through Audio-Visual Relative Norm Alignment in First Person Action Recognition, 2022
- Audio-Adaptive Activity Recognition Across Video Domains, 2022
- MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition, 2022
- Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos, 2022
- Tensor Fusion Network for Multimodal Sentiment Analysis, 2017
- Multi-attention Recurrent Network for Human Communication Comprehension, 2018
- Memory Fusion Network for Multi-view Sequential Learning, 2018
- Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos, 2018
- Contextual Inter-modal Attention for Multi-modal Sentiment Analysis, 2018
- Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model, 2019
- Sentiment and Emotion help Sarcasm? A Multi-task Learning Framework for Multi-Modal Sarcasm, Sentiment and Emotion Analysis, 2020
- A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis
- Multilogue-Net: A Context Aware RNN for Multi-modal Emotion Detection and Sentiment Analysis in Conversation, 2020
- Progressive Modality Reinforcement for Human Multimodal Emotion Recognition From Unaligned Multimodal Sequences, 2021
- Multi-modal Sarcasm Detection and Humor Classification in Code-mixed Conversations, 2021
- Visual Speech Enhancement, 2018
- The Conversation: Deep Audio-Visual Speech Enhancement, 2018
- Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks, 2018
- Seeing Through Noise: Visually Driven Speaker Separation And Enhancement, 2018
- Visually Assisted Time-Domain Speech Enhancement, 2019
- On Training Targets and Objective Functions for Deep-learning-based Audio-visual Speech Enhancement, 2019
- Multimodal SpeakerBeam: Single Channel Target Speech Extraction with Audio-Visual Speaker Clues, 2019
- My Lips Are Concealed: Audio-Visual Speech Enhancement Through Obstructions, 2019
- Facefilter: Audio-Visual Speech Separation Using Still Images, 2020
- Robust Unsupervised Audio-Visual Speech Enhancement Using a Mixture of Variational Autoencoders, 2020
- Looking Into Your Speech: Learning Cross-Modal Affinity for Audio-Visual Speech Separation, 2021
- The Impact of Removing Head Movements on Audio-Visual Speech Enhancement, 2022
- Learning to Separate Object Sounds by Watching Unlabeled Video, 2018
- The Sound of Pixels, 2018
- Self-supervised Audio-visual Co-segmentation, 2019
- The Sound of Motions, 2019
- Recursive Visual Sound Separation Using Minus-Plus Net, 2019
- Co-Separating Sounds of Visual Objects, 2019
- Visually Guided Sound Source Separation using Cascaded Opponent Filter Network, 2020
- Music Gesture for Visual Sound Separation, 2020
- Visual Scene Graphs for Audio Source Separation, 2021
- Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation, 2021
- Learning to Have an Ear for Face Super-Resolution, 2020
- Appearance Matters, So Does Audio: Revealing the Hidden Face via Cross-Modality Transfer, 2021
- Vid2speech: Speech Reconstruction From Silent Video, 2017
- Improved Speech Reconstruction From Silent Video, 2017
- Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video, 2018
- Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed, 2018
- Video-Driven Speech Reconstruction using Generative Adversarial Networks, 2019
- Hush-Hush Speak: Speech Reconstruction Using Silent Videos, 2019
- End-to-End Video-to-Speech Synthesis Using Generative Adversarial Networks, 2022
- Real-Time Piano Music Transcription Based on Computer Vision, 2015
- Deep Cross-Modal Audio-Visual Generation, 2017
- Audeo: Audio Generation for a Silent Performance Video, 2020
- Foley Music: Learning to Generate Music from Videos, 2020
- Sight to Sound: An End-to-End Approach for Visual Piano Transcription, 2020
- Multi-Instrumentalist Net: Unsupervised Generation of Music from Body Movements, 2020
- Collaborative Learning to Generate Audio-Video Jointly, 2021
- Video Background Music Generation with Controllable Music Transformer, 2021
- Visually Indicated Sounds, 2016
- Visual to Sound: Generating Natural Sound for Videos in the Wild, 2018
- Generating Visually Aligned Sound From Videos, 2020
- Taming Visually Guided Sound Generation, 2021
- Scene-aware audio for 360° videos, 2018
- Self-Supervised Generation of Spatial Audio for 360° Video, 2018
- 2.5D Visual Sound, 2019
- Self-Supervised Audio Spatialization with Correspondence Classifier, 2019
- Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation, 2020
- Visually Informed Binaural Audio Generation without Binaural Audios, 2021
- Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation, 2021
- Beyond Mono to Binaural: Generating Binaural Audio From Mono Audio With Depth and Cross Modal Attention, 2022
- Synthesizing Obama: learning lip sync from audio, 2017
- Lip Movements Generation at a Glance, 2018
- You Said That?: Synthesising Talking Faces from Audio, 2019
- Few-Shot Adversarial Learning of Realistic Neural Talking Head Models, 2019
- Realistic Speech-Driven Facial Animation with GANs, 2020
- GANimation: One-Shot Anatomically Consistent Facial Animation, 2020
- Makelttalk: Speaker-Aware Talking-Head Animation, 2020
- FReeNet: Multi-Identity Face Reenactment, 2020
- Neural Voice Puppetry: Audio-driven Facial Reenactment, 2020
- Rotate-and-Render: Unsupervised Photorealistic Face Rotation from Single-View Images, 2020
- MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation, 2020
- Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation, 2021
- Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation, 2021
- Audio-Driven Emotional Video Portraits, 2021
- Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network, 2018
- Analyzing Input and Output Representations for Speech-Driven Gesture Generation, 2019
- Learning Individual Styles of Conversational Gesture, 2019
- To React or not to React: End-to-End Visual Pose Forecasting for Personalized Avatar during Dyadic Conversations, 2019
- Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows, 2020
- Gesticulator: A Framework For Semantically-Aware Speech-Driven Gesture Generation, 2020
- Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach, 2020
- Speech Gesture Generation From The Trimodal Context Of Text, Audio, And Speaker Identity, 2020
- SEEG: Semantic Energized Co-Speech Gesture Generation, 2022
- Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis, 2018
- Audio to Body Dynamics, 2018
- Dancing to Music, 2019
- Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning, 2020
- AI Choreographer: Music Conditioned 3D Dance Generation With AIST++, 2021
- Genre-Conditioned Long-Term 3D Dance Generation Driven by Music, 2022
- BatVision: Learning to See 3D Spatial Layout with Two Ears, 2020
- VISUALECHOES: Spatial Image Representation Learning Through Echolocation, 2020
- Beyond Image to Depth: Improving Depth Prediction Using Echoes, 2021
- Co-Attention-Guided Bilinear Model for Echo-Based Depth Estimation, 2022
- SoundNet: Learning Sound Representations from Unlabeled Video, 2016
- Self-Supervised Moving Vehicle Tracking With Stereo Sound, 2019
- There Is More Than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking With Sound by Distilling Multimodal Knowledge, 2021
- Enhanced Audio Tagging via Multi- to Single-Modal Teacher-Student Mutual Learning, 2021
- Knowledge Distillation from Multi-Modality to Single-Modality for Person Verification, 2021
- Multimodal Knowledge Expansion, 2021
- Distilling Audio-visual Knowledge by Compositional Contrastive Learning, 2021
- Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint, 2017
- Image2song: Song Retrieval via Bridging Image Content and Lyric Words, 2017
- Cross-modal Embeddings for Video and Audio Retrieval, 2018
- Seeing voices and hearing faces: Cross-modal biometric matching, 2018
- Audio-Visual Embedding for Cross-Modal Music Video Retrieval through Supervised Deep CCA, 2019
- Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval, 2020
- Deep Cross-Modal Image–Voice Retrieval in Remote Sensing, 2020
- Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast, 2022
- Look, Listen and Learn, 2017
- Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization, 2018
- Learning Representations from Audio-Visual Spatial Alignment, 2020
- Audio-Visual Instance Discrimination with Cross-Modal Agreement, 2021
- Robust Audio-Visual Instance Discrimination, 2021
- Unsupervised Sound Localization via Iterative Contrastive Learning, 2021
- Self-Supervised Learning by Cross-Modal Audio-Video Clustering, 2020
- Labelling Unlabelled Videos From Scratch With Multi-Modal Self-Supervision, 2020
- Multimodal Clustering Networks for Self-Supervised Learning From Unlabeled Videos, 2021
- OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation, 2021
- VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, 2021
- Audioclip: Extending Clip to Image, Text and Audio, 2022
- MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound, 2022
- Objects that Sound, 2018
- Audio-Visual Scene Analysis with Self-Supervised Multisensory Features, 2018
- The Sound of Pixels, 2018
- Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications, 2019
- Self-supervised Audio-visual Co-segmentation, 2019
- The Sound of Motions, 2019
- Deep Multimodal Clustering for Unsupervised Audiovisual Learning, 2019
- Localizing Visual Sounds the Hard Way, 2021
- Class-aware Sounding Objects Localization via Audiovisual Correspondence, 2021
- Mix and Localize: Localizing Sound Sources in Mixtures, 2022
- Audio-Visual Segmentation, 2022
- DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction, 2019
- STAViS: Spatio-Temporal AudioVisual Saliency Network, 2020
- A Multimodal Saliency Model for Videos With High Audio-visual Correspondence, 2020
- ViNet: Pushing the limits of Visual Modality for Audio-Visuav Saliency Prediction, 2021
- From Semantic Categories to Fixations: A Novel Weakly-Supervised Visual-Auditory Saliency Detection Approach, 2021
- SoundSpaces: Audio-Visual Navigation in 3D Environments, 2020
- Look, Listen, and Act: Towards Audio-Visual Embodied Navigation, 2020
- Learning to Set Waypoints for Audio-Visual Navigation, 2020
- Semantic Audio-Visual Navigation, 2021
- Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped Environments with Moving Sounds, 2021
- Move2Hear: Active Audio-Visual Source Separation, 2021
- Sound Adversarial Audio-Visual Navigation, 2022
- Audio-visual Event Localization in Unconstrained Videos, 2018
- Dual-modality Seq2Seq Network for Audio-visual Event Localization, 2019
- Dual Attention Matching for Audio-Visual Event Localization, 2019
- Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization, 2020
- Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization, 2020
- Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention, 2021
- Positive Sample Propagation along the Audio-Visual Event Line, 2021
- Cross-Modal Background Suppression for Audio-Visual Event Localization, 2022
- Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing, 2020
- Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing, 2021
- Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing, 2021
- Investigating Modality Bias in Audio Visual Video Parsing, 2022
- Distributed Audio-Visual Parsing Based On Multimodal Transformer and Deep Joint Source Channel Coding, 2022
- Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg Videos, 2021
- Learning To Answer Questions in Dynamic Audio-Visual Scenarios, 2022
- Audio Visual Scene-Aware Dialog, 2019
- Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog, 2019
- End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features, 2019
- A Simple Baseline for Audio-Visual Scene-Aware Dialog, 2019
- Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers, 2021
- Audio-Visual Scene-Aware Dialog and Reasoning Using Audio-Visual Transformers with Joint Student-Teacher Learning, 2022
Dataset | Year | # Videos | Length | Data form | Video source | Task |
---|---|---|---|---|---|---|
LRW, LRS2 and LRS3 | 2016,2018, 2018 | - | 800h+ | video | in the wild | Speech-related, speaker-related,face generation-related tasks |
VoxCeleb, VoxCeleb2 | 2017, 2018 | - | 2,000h+ | video | YouTube | Speech-related, speaker-related,face generation-related tasks |
AVA-ActiveSpeaker} | 2019 | - | 38.5h | video | YouTube | Speech-related task, speaker-related task |
Kinetics-400 | 2017 | 306,245 | 850h+ | video | YouTube | Action recognition |
EPIC-KITCHENS | 2018 | 39,594 | 55h | video | Recorded videos | Action recognition |
CMU-MOSI | 2016 | 2,199 | 2h+ | video | YouTube | Emotion recognition |
CMU-MOSEI | 2018 | 23,453 | 65h+ | video | YouTube | Emotion recognition |
VGGSound | 2020 | 200k+ | 550h+ | video | YouTube | Action recognition, sound localization |
AudioSet | 2017 | 2M+ | 5,800h+ | video | YouTube | Action recognition, sound sepearation |
Greatest Hits | 2016 | 977 | 9h+ | video | Recorded videos | Sound generation |
MUSIC | 2018 | 714 | 23h+ | video | YouTube | Sound seperation, sound localization |
FAIR-Play | 2019 | 1,871 | 5.2h | video with binaural sound | Recorded videos | Spatial sound generation |
YT-ALL | 2018 | 1,146 | 113.1h | 360 video | YouTube | Spatial sound generation |
Replica | 2019 | - | - | 3D environment | 3D simulator | Depth estimation |
AIST++ | 2021 | - | 5.2h | 3D video | Recorded videos | Dance generation |
TED | 2019 | - | 52h | video | TED talks | Gesture generation |
SumMe | 2014 | 25 | 1h+ | video with eye-tracking | User videos | Saliency detection |
AVE | 2018 | 4,143 | 11h+ | video | YouTube | Event localization |
LLP | 2020 | 11,849 | 32.9h | video | YouTube | Event parsing |
SoundSpaces | 2020 | - | - | 3D environment | 3D simulator | Audio-visual navigation |
AVSD | 2019 | 11,816 | 98h+ | video with dialog | Crowd-sourced | Audio-visual dialog |
Pano-AVQA | 2021 | 5.4k | 7.7h | 360 video with QA | Video-sharing platforms | Audio-visual question answering |
MUSIC-AVQA | 2022 | 9,288 | 150h+ | video with QA | YouTube | Audio-visual question answering |
AVSBench | 2022 | 5,356 | 14.8h+ | video | YouTube | Audio-visual segmentation, sound localization |