Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

This document provides a list of resources on audio-visual speech enhancement and separation based on deep learning. It can be seen as an appendix to our overview paper on the topic.

Our intention is to update the list, when new material on the topic is released. The symbol '*' will be used beside a resource to indicate that it was not cited in our overview article. Feel free to propose changes or point out a resource that should be included.

If you like and use this work, please ⭐ and consider citing our overview article. This will highlight the interest of the community in our work.

@article{michelsanti2021overview,
Author = {Michelsanti, Daniel and Tan, Zheng-Hua and Zhang, Shi-Xiong and Xu, Yong and Yu, Meng and Yu, Dong and Jensen, Jesper},
Title = {An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
volume={29},
pages={1368-1396},
doi={10.1109/TASLP.2021.3066303}
Year = {2021}}

Audio-Visual Speech Corpora
Performance Assessment
Audio-Visual Speech Enhancement and Separation
Speech Reconstruction from Silent Videos
Audio-Visual Sound Source Separation for Non-Speech Signals
Related Overview Articles

Audio-Visual Speech Corpora

AVA-ActiveSpeaker [paper] [dataset page]
AV Chinese Mandarin [paper]
AVSpeech [paper] [dataset page]
ASPIRE [paper] [dataset page]
CREMA-D [paper] [dataset page] *
CUAVE [paper]
EasyCom [paper] [dataset page] *
Facestar [paper] [dataset page] *
GRID [paper] [dataset page]
KinectDigits [paper] [dataset page]
LDC2009V01 [dataset page]
Lombard GRID [paper] [dataset page]
LRS [paper]
LRS2 [paper] [dataset page]
LRS3 [paper] [dataset page]
LRW [paper] [dataset page]
Mandarin Sentences Corpus [paper]
MODALITY [paper] [dataset page]
MV-LRS [paper]
NTCD-TIMIT [paper] [dataset page]
Obama Weekly Addresses [paper]
OuluVS [paper] [dataset page]
OuluVS2 [paper] [dataset page]
RAVDESS [paper] [dataset page]
Small Mandarin Sentences Corpus [paper]
TCD-TIMIT [paper] [dataset page]
VISION [paper] *
VoxCeleb [paper] [dataset page]
VoxCeleb2 [paper] [dataset page]

Performance Assessment

Estimators of speech quality based on perceptual models

Estimators of speech quality based on energy ratios

SDR / SIR / SAR (BSS Eval) [paper] [code]
SDI [paper]
SI-SDR [paper] [code]

Estimators of speech intelligibility

CSII [paper]
ESII [paper]
ESTOI [paper] [code]
HASPI [paper] [code]
SII [paper] [code]
STOI [paper] [code]

Audio-Visual Speech Enhancement and Separation

A. Adeel, J. Ahmad, H. Larijani, and A. Hussain, “A novel real-time, lightweight chaotic-encryption scheme for next-generation audio-visual hearing aids,” Cognitive Computation, vol. 12, no. 3, pp. 589–601, 2019. [paper]
A. Adeel, M. Gogate, and A. Hussain, “Towards next-generation lip-reading driven hearing-aids: A preliminary prototype demo,” in Proc. of CHAT, 2017. [paper] [demo]
A. Adeel, M. Gogate, and A. Hussain, “Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments,” Information Fusion, vol. 59, pp. 163–170, 2020. [paper]
A. Adeel, M. Gogate, A. Hussain, and W. M. Whitmer, “Lip-reading driven deep learning approach for speech enhancement,” IEEE Transactions on Emerging Topics in Computational Intelligence, 2019. [paper]
T. Afouras, J. S. Chung, and A. Zisserman, “The conversation: Deep audio-visual speech enhancement,” Proc. of Interspeech, 2018. [paper] [project page] [demo 1] [other demos]
T. Afouras, J. S. Chung, and A. Zisserman, “My lips are concealed: Audio-visual speech enhancement through obstructions,” in Proc. of Interspeech, 2019. [paper] [project page] [demo]
Z. Aldeneh, A. P. Kumar, B.-J. Theobald, E. Marchi, S. Kajarekar, D. Naik, and A. H. Abdelaziz, “Self-supervised learning of visual speech features with audiovisual speech enhancement,” arXiv preprint arXiv:2004.12031, 2020. [paper]
A. Arriandiaga, G. Morrone, L. Pasa, L. Badino, and C. Bartolozzi, “Audio-visual target speaker extraction on multi-talker environment using event-driven cameras,” arXiv preprint arXiv:1912.02671, 2019. [paper]
S.-Y. Chuang, Y. Tsao, C.-C. Lo, and H.-M. Wang, “Lite audio-visual speech enhancement,” in Proc. of Interspeech (to appear), 2020. [paper] [code]
H. Chen, J. Du, Y. Hu, L.-R. Dai, B.-C. Yin, C.-H. Lee, “Correlating subword articulation with lip shapes for embedding aware audio-visual speech enhancement ,” in Neural Network, vol. 143, pp. 171-182, 2021. [paper] *
S.-W. Chung, S. Choe, J. S. Chung, and H.-G. Kang, “Facefilter: Audio-visual speech separation using still images,” arXiv preprint arXiv:2005.07074, 2020. [paper] [demo]
A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” ACM Transactions on Graphics, vol. 37, no. 4, pp. 112:1–112:11, 2018. [paper] [project page] [demo] [supplementary material]
A. Gabbay, A. Ephrat, T. Halperin, and S. Peleg, “Seeing through noise: Visually driven speaker separation and enhancement,” in Proc. of ICASSP, 2018. [paper] [project page] [demo] [code]
A. Gabbay, A. Shamir, and S. Peleg, “Visual speech enhancement,” in Proc. of Interspeech, 2018. [paper] [project page] [demo 1] [other demos] [code]
R. Gao and K. Grauman, “VISUALVOICE: Audio-visual speech separation with cross-modal consistency,” in Proc. of CVPR, 2021. [paper] [project page] [demo] [code] [supplementary material] *
M. Gogate, A. Adeel, R. Marxer, J. Barker, and A. Hussain, “DNN driven speaker independent audio-visual mask estimation for speech separation,” in Proc. of Interspeech, 2018. [paper]
M. Gogate, K. Dashtipour, A. Adeel, and A. Hussain, “Cochleanet: A robust language-independent audio-visual model for speech enhancement,” Information Fusion, vol. 63, pp. 273–285, 2020. [paper] [project page] [demo] [supplementary material]
M. Gogate, K. Dashtipour, and A. Hussain, “Towards Robust Real-time Audio-Visual Speech Enhancement,” arXiv preprint arXiv:2112.09060, 2021. [paper] *
A. Golmakani, M. Sadeghi, R. Serizel, "Audio-visual speech enhancement with a deep Kalman filter generative model," arXiv preprint arXiv:2211.00988, 2021. [paper] *
R. Gu, S.-X. Zhang, Y. Xu, L. Chen, Y. Zou, and D. Yu, “Multi-modal multi-channel target speech separation,” IEEE Journal of Selected Topics in Signal Processing, 2020. [paper] [project page] [demo]
J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. Chang, and H.- M. Wang, “Audio-visual speech enhancement using multimodal deep convolutional neural networks,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 117–128, 2018. [paper]
J.-C. Hou, S.-S. Wang, Y.-H. Lai, J.-C. Lin, Y. Tsao, H.-W. Chang, and H.-M. Wang, “Audio-visual speech enhancement using deep neural networks,” in Proc. of APSIPA, 2016. [paper]
A. Hussain, J. Barker, R. Marxer, A. Adeel, W. Whitmer, R. Watt, and P. Derleth, “Towards multi-modal hearing aid design and evaluation in realistic audio-visual settings: Challenges and opportunities,” in Proc. of CHAT, 2017. [paper]
T. Hussain, M. Gogate, K. Dashtipour, and A. Hussain, “Towards intelligibility-oriented audio-visual speech enhancement,” arXiv preprint arXiv:2111.09642, 2021. [paper] *
E. Ideli, “Audio-visual speech processing using deep learning techniques.” MSc thesis, Applied Sciences: School of Engineering Science, 2019. [paper]
E. Ideli, B. Sharpe, I. V. Bajić, and R. G. Vaughan,“Visually assisted time-domain speech enhancement,” in Proc. of GlobalSIP, 2019. [paper]
B. İnan, M. Cernak, H. Grabner, H. P. Tukuljac, R. C. Pena, and B. Ricaud, “Evaluating audiovisual source separation in the context of video conferencing,” Proc. of Interspeech, 2019. [paper] [code]
K. Ito, M. Yamamoto, and K. Nagamatsu, “Audio-visual speech enhancement method conditioned in the lip motion and speaker-discriminative embeddings,” Proc. of ICASSP, 2021. [paper] *
M. L. Iuzzolino and K. Koishida, “AV(SE)²: Audio-visual squeeze- excite speech enhancement,” in Proc. of ICASSP. IEEE, 2020, pp. 7539–7543. [paper]
H. R. V. Joze, A. Shaban, M. L. Iuzzolino, and K. Koishida, “MMTM: Multimodal transfer module for CNN fusion,” Proc. of CVPR, 2020. [paper]
Z. Kang, M. Sadeghi, R. Horaud, X. Alameda-Pineda, J. Donley, and A. Kumar, “The impact of removing head movements on audio-visual speech enhancement,” arXiv preprint arXiv:2202.00538, 2022. [paper] [project page] *
F. U. Khan, B. P. Milner, and T. Le Cornu, “Using visual speech information in masking methods for audio speaker separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1742–1754, 2018. [paper]
C. Li and Y. Qian, “Deep audio-visual speech separation with attention mechanism,” in Proc. of ICASSP, 2020. [paper]
Y. Li, Z. Liu, Y. Na, Z. Wang, B. Tian, and Q. Fu, “A visual-pilot deep fusion for target speech separation in multitalker noisy environment,” in Proc. of ICASSP, 2020. [paper]
R. Lu, Z. Duan, and C. Zhang, “Listen and look: Audio–visual matching assisted speech source separation,” IEEE Signal Processing Letters, vol. 25, no. 9, pp. 1315–1319, 2018. [paper]
R. Lu, Z. Duan, and C. Zhang, “Audio–visual deep clustering for speech separation, ”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 11, pp. 1697–1712, 2019. [paper]
Y. Luo, J. Wang, X. Wang, L. Wen, and L. Wang, “Audio-visual speech separation using i-Vectors,” in Proc. of ICICSP, 2019. [paper]
N. Makishima, M. Ihori, A. Takashima, T. Tanaka, S. Orihashi, and R. Masumura, “Audio-visual speech separation using cross-modal correspondence loss,” in Proc. of ICASSP, 2021. [paper] *
D. Michelsanti, Z.-H. Tan, S. Sigurdsson, and J. Jensen, “On training targets and objective functions for deep-learning-based audio-visual speech enhancement,” in Proc. of ICASSP, 2019. [paper] [supplementary material]
D. Michelsanti, Z.-H. Tan, S. Sigurdsson, and J. Jensen, “Deep- learning-based audio-visual speech enhancement in presence of Lombard effect,” Speech Communication, vol. 115, pp. 38–50, 2019. [paper] [demo]
D. Michelsanti, Z.-H. Tan, S. Sigurdsson, and J. Jensen, “Effects of Lombard reflex on the performance of deep-learning-based audio-visual speech enhancement systems,” in Proc. of ICASSP, 2019. [paper] [demo]
J. F. Montesinos, V. S. Kadandale, and G. Haro, “VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer,” arXiv preprint arXiv:2203.04099, 2022. [paper] [demo] [code] [project page] *
G. Morrone, S. Bergamaschi, L. Pasa, L. Fadiga, V. Tikhanoff, and L. Badino, “Face landmark-based speaker-independent audio-visual speech enhancement in multi-talker environments,” in Proc. of ICASSP, 2019. [paper] [project page] [demo] [other demos] [code]
T. Ochiai, M. Delcroix, K. Kinoshita, A. Ogawa, and T. Nakatani, “Multimodal SpeakerBeam: Single channel target speech extraction with audio-visual speaker clues,” Proc. Interspeech, 2019. [paper]
A. Owens and A. A. Efros, “Audio-visual scene analysis with self-supervised multisensory features,” in Proc. of ECCV, 2018. [paper] [project page] [demo] [code]
Z. Pan, M. Ge and H. Li, “USEV: Universal speaker extraction with visual cue,” 2021. [paper] [code] *
Z. Pan, R. Tao, C. Xu and H. Li, “MuSe: Multi-modal target speaker extraction with visual cues,” in Proc. of ICASSP, 2021. [paper] [code] *
Z. Pan, R. Tao, C. Xu and H. Li, “Selective Hearing through Lip-reading,” arXiv preprint arXiv:2106.07150, 2021. [paper] [code] *
L. Pasa, G. Morrone, and L. Badino, “An analysis of speech enhancement and recognition losses in limited resources multi-talker single channel audio-visual ASR,” in Proc. of ICASSP, 2020. [paper]
L. Qu, C. Weber, and S. Wermter, “Multimodal target speech separation with voice and face references,” arXiv preprint arXiv:2005.08335, 2020. [paper] [project page] [demo]
A. Rahimi, T. Afouras, A. Zisserman, “Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation,” Proc. of CVPR, 2022. [paper] [project page] *
M. Sadeghi and X. Alameda-Pineda, “Mixture of inference networks for VAE-based audio-visual speech enhancement,” arXiv preprint arXiv:1912.10647, 2019. [paper] [project page] [demo] [code]
M. Sadeghi and X. Alameda-Pineda, “Robust unsupervised audio-visual speech enhancement using a mixture of variational autoencoders,” in Proc. of ICASSP, 2020. [paper] [project page] [supplementary material] [code]
M. Sadeghi and X. Alameda-Pineda, “Switching variational auto-encoders for noise-agnostic audio-visual speech enhancement,” in Proc. of ICASSP, 2021. [paper] [project page] *
M. Sadeghi, S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud, “Audio-visual speech enhancement using conditional variational autoencoders,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1788–1800, 2020. [paper] [project page] [demo] [code]
H. Sato, T. Ochiai, K. Kinoshita, M. Delcroix, T. Nakatani and S. Araki. “Multimodal attention fusion for target speaker extraction,” in Proc. of SLT, 2021. [paper] [project page] [demo] *
S. S. Shetu, S. Chakrabarty, and E. A. P. Habets, “An empirical study of visual features for DNN based audio-visual speech enhancement in multi-talker environments,” in Proc. of ICASSP, 2021. [paper] *
Z. Sun, Y. Wang, and L. Cao, “An attention based speaker-independent audio-visual deep learning model for speech enhancement,” in Proc. of MMM, 2020. [paper]
K. Tan, Y. Xu, S.-X. Zhang, M. Yu, and D. Yu, “Audio-visual speech separation and dereverberation with a two-stage multimodal network,” IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 3, pp. 542–553, 2020. [paper] [project page] [demo]
W. Wang, C. Xing, D. Wang, X. Chen, and F. Sun, “A robust audio-visual speech enhancement model,” in Proc. of ICASSP, 2020. [paper]
J. Wu, Y. Xu, S.-X. Zhang, L.-W. Chen, M. Yu, L. Xie, and D. Yu, “Time domain audio visual speech separation,” in Proc. of ASRU, 2019. [paper] [project page] [demo]
Z. Wu, S. Sivadas, Y. K. Tan, M. Bin, and R. S. M. Goh,“Multi-modal hybrid deep neural network for speech enhancement,” arXiv preprint arXiv:1606.04750, 2016. [paper]
X. Xu, Y. Wang, D. Xu, C. Zhang, Y. Peng, J. Jia, and B. Chen, “VSEGAN: Visual speech enhancement generative adversarial network,” arXiv preprint arXiv:2102.02599, 2021. [paper] [project page] *
X. Xu, Y. Wang, D. Xu, C. Zhang, Y. Peng, J. Jia, and B. Chen, “AMFFCN: Attentional multi-layer feature fusion convolution network for audio-visual speech enhancement,” arXiv preprint arXiv:2101.06268, 2021. [paper] [project page] *
X. Xu, Y. Wang, J. Jia, B. Chen and D. Li, “Improving visual speech enhancement network by learning audio-visual affinity with multi-head attention,” arXiv preprint arXiv:2206.14964, 2022. [paper] [project page] *
Y. Xu, M. Yu, S.-X. Zhang, L. Chen, C. Weng, J. Liu, and D. Yu, “Neural spatio-temporal beamformer for target speech separation,” Proc. of Interspeech (to appear), 2020. [paper] [project page] [demo]
K. Yang, D. Markovic, S. Krenn, V. Agrawal, and A. Richard, “Audio-visual speech codecs: rethinking audio-visual speech enhancement by re-synthesis,” Proc. of CVPR (to appear), 2022. [paper] *

Speech Reconstruction From Silent Videos

H. Akbari, H. Arora, L. Cao, and N. Mesgarani, “Lip2AudSpec: Speech reconstruction from silent lip movements video,” in Proc. of ICASSP, 2018. [paper] [demo 1] [demo 2] [demo 3] [code]
A. Ephrat, T. Halperin, and S. Peleg, “Improved speech reconstruction from silent video,” in Proc. of CVAVM, 2017. [paper] [project page] [demo]
A. Ephrat and S. Peleg, “Vid2Speech: Speech reconstruction from silent video,” in Proc. of ICASSP, 2017. [paper] [project page] [demo 1] [demo 2] [demo 3] [code]
J. Hong, M. Kim, S.J. Park, Y.M. Ro, “Speech reconstruction with reminiscent sound via visual voice memory,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3654-3667, 2021. [paper] [demo] *
J. Hong, M. Kim, S.J. Park, Y.M. Ro, “VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection,” arXiv preprint arXiv:2206.07458, 2022. [paper] [demo] *
M. Kim, J. Hong, Y. M. Ro, “Lip to speech synthesis with visual context attentional GAN,” in Proc. of NeurIPS, 2021. [paper] *
Y. Kumar, M. Aggarwal, P. Nawal, S. Satoh, R. R. Shah, and R. Zimmermann, “Harnessing AI for speech reconstruction using multi-view silent video feed,” in Proc. of ACM-MM, 2018. [paper]
Y. Kumar, R. Jain, K. M. Salik, R. R. Shah, Y. Yin, and R. Zimmermann, “Lipper: Synthesizing thy speech using multi-view lipreading,” in Proc. of AAAI, 2019. [paper] [demo]
Y. Kumar, R. Jain, M. Salik, R. R. Shah, R. Zimmermann, and Y. Yin, “MyLipper: A personalized system for speech reconstruction using multi-view visual feeds,” in Proc. of ISM, 2018. [paper] [demo]
T. Le Cornu and B. Milner, “Reconstructing intelligible audio speech from visual speech features,” in Proc. of Interspeech, 2015. [paper]
T. Le Cornu and B. Milner, “Generating intelligible audio speech from visual speech,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 9, pp. 1751–1761, 2017. [paper] [demo]
D. Michelsanti, O. Slizovskaia, G. Haro, E. Go ́mez, Z.-H. Tan, and J. Jensen, “Vocoder-based speech synthesis from silent videos,” in Proc. of Interspeech (to appear), 2020. [paper] [project page] [demo]
R. Mira, A. Haliassos, S. Petridis, B. W. Schuller, and M. Pantic “SVTS: Scalable Video-to-Speech Synthesis,” arXiv preprint arXiv:2205.02058, 2022. [paper] [demo] *
R. Mira, K. Vougioukas, P. Ma, S. Petridis, B. W. Schuller, and M. Pantic, “End-to-end video-to-speech synthesis using generative adversarial networks,” arXiv preprint arXiv:2104.13332, 2021. [paper] [project page] *
K. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. Jawahar, “Learning individual speaking styles for accurate lip to speech synthesis,” in Proc. of CVPR, 2020. [paper] [project page] [demo] [code]
L. Qu, C. Weber, and S. Wermter. "LipSound: Neural Mel-Spectrogram Reconstruction for Lip Reading," in Proc. of Interspeech, 2019. [paper] [demo] *
L. Qu, C. Weber, and S. Wermter. "LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading," in IEEE Transactions on Neural Networks and Learning Systems, 2022. [paper] [demo] *
N. Saleem, J. Gao, M. Irfan, E. Verdu, and J. Parra Fuente. “E2E-V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis,” in Image and Vision Computing, Vol. 119, 2022. [paper] *
Y. Takashima, T. Takiguchi, and Y. Ariki, “Exemplar-based lip-to-speech synthesis using convolutional neural networks,” in Proc. of IW-FCV, 2019. [paper]
S. Uttam, Y. Kumar, D. Sahrawat, M. Aggarwal, R. R. Shah, D. Mahata, and A. Stent, “Hush-hush speak: Speech reconstruction using silent videos,” in Proc. of Interspeech, 2019. [paper] [demo] [code]
M. Varshney, R. Yadav, V. P. Namboodiri, R. M. Hegde, “Learning Speaker-specific Lip-to-Speech Generation,” in arXiv preprint arXiv:2206.02050. [paper] [project page] *
K. Vougioukas, P. Ma, S. Petridis, and M. Pantic, “Video-driven speech reconstruction using generative adversarial networks,” in Proc. of Interspeech, 2019. [paper] [project page] [demo 1] [demo 2] [demo 3]
D. Wang, S. Yang, D. Su, X. Liu, D. Yu, and H. Meng, “VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion,” in arXiv preprint arXiv:2202.09081. [paper] [demo] *
Y. Wang, and Z. Zhao, “FastLTS: Non-autoregressive end-to-end unconstrained lip-to-speech synthesis,” in arXiv preprint arXiv:2207.03800. [paper] *
R. Yadav, A. Sardana, V. P. Namboodiri, and R. M. Hegde, “Speech prediction in silent videos using variational autoencoders,” in Proc. of ICASSP, 2021. [paper] *

Audio-Visual Sound Source Separation for Non-Speech Signals

C. Gan, D. Huang, H. Zhao, J. B. Tenenbaum, and A. Torralba, “Music gesture for visual sound separation,” in Proc. of CVPR, 2020. [paper] [project page] [demo]
R. Gao, R. Feris, and K. Grauman, “Learning to separate object sounds by watching unlabeled video,” in Proc. of ECCV, 2018. [paper] [project page] [demo 1] [demo 2] [code]
R. Gao and K. Grauman, “2.5D visual sound,” in Proc. of CVPR, 2019. [paper] [project page] [demo] [code]
R. Gao and K. Grauman, “Co-separating sounds of visual objects,” in Proc. of ICCV, 2019. [paper] [project page] [demo] [code]
S. Parekh, A. Ozerov, S. Essid, N. Q. Duong, P. Pérez, and G. Richard, “Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision,” in Proc. of WASPAA, 2019. [paper] [project page] [demo]
J. F. Montesinos, V. S. Kadandale, and G. Haro, “A cappella: Audio-visual Singing Voice Separation,” in Proc. of BMVC, 2021. [paper] [project page] [demo][code] *
A. Rouditchenko, H. Zhao, C. Gan, J. McDermott, and A. Torralba, “Self-supervised audio-visual co-segmentation,” in Proc. of ICASSP, 2019. [paper]
O. Slizovskaia, G. Haro, and E. Gómez, “Conditioned source separation for music instrument performances,” arXiv preprint arXiv:2004.03873, 2020. [paper] [project page] [demo][code]
X. Xu, B. Dai, and D. Lin, “Recursive visual sound separation using minus-plus net,” in Proc. of ICCV, 2019. [paper] [demo]
H. Zhao, C. Gan, W.-C. Ma, and A. Torralba, “The sound of motions,” in Proc. of ICCV, 2019. [paper] [demo]
H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba, “The sound of pixels,” in Proc. of ECCV, 2018. [paper] [project page] [demo 1] [demo 2][code]
L. Zhu and E. Rahtu, “Separating sounds from a single image,” arXiv preprint arXiv:2007.07984, 2020. [paper] [project page]
L. Zhu and E. Rahtu, “Visually guided sound source separation using cascaded oppo- nent filter network,” arXiv preprint arXiv:2006.03028, 2020. [paper] [project page]

Audio-Visual Speech Inpainting

G. Morrone, D. Michelsanti, Z.-H. Tan and J. Jensen, “Audio-visual speech inpainting with deep learning,” in Proc. of ICASSP, 2021. [paper] [project page] [demo]

Challenges

1st COG-MHEAR Audio-Visual Speech Enhancement Challenge (AVSE) [challenge page] [baseline code] *

snsun / av-se Goto Github PK

av-se's Introduction

Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Table of Contents

Audio-Visual Speech Corpora

Performance Assessment

Estimators of speech quality based on perceptual models

Estimators of speech quality based on energy ratios

Estimators of speech intelligibility

Audio-Visual Speech Enhancement and Separation

Speech Reconstruction From Silent Videos

Audio-Visual Sound Source Separation for Non-Speech Signals

Audio-Visual Speech Inpainting

Related Overview Articles

Challenges

av-se's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent