Giter Club home page Giter Club logo

audio_visual_speech_enhancement's Introduction

Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments

Implementation of the audio-visual speech enhancement system described in the paper Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments by University of Modena and Reggio Emilia and Istituto Italiano di Tecnologia.

If you are interested in this work check out the project page.

Getting Started

Install requirements

All code is written for Python 3. Create a virtual environment (optional) and install all the requirements running:

pip install -r requirements.txt

Usage

The main program is av_speech_enhancement.py. You can get a list of subcommands typing av_speech_enhancement.py -h. Try av_speech_enhancement.py <subcommand> -h for more information about a subcommand. The audio-visual dataset must have the following directory structure:

s1
  /audio
	/file1.waw
	/file2.wav
	...
  /video
	/file1.mpg
	/file2.mpg
	...
s2
  /audio
	/file1.wav
	/file2.wav
	...
  /video
	/file1.mpg
	/file2.mpg
	...
...

Mixed-speech generation

Generate mixed-speech for training, validation and test sets separately:

av_speech_enhancement.py mixed_speech_generator
	--data_dir <data_dir>
	--base_speaker_ids <spk1> <spk2> <...>
	[--noisy_speaker_ids <spk1> <spk2> <...>]
	--audio_dir <audio_dir>
	--dest_dir <dest_dir>
	--num_samples <num_samples>
	--num_mix <num_mix>
	--num_mix_speakers <num_mix_speakers> {1,2}

The generated files are organized as follow:

TRAINING_SET
	    /s1
	       /file1_with_s2_file2.wav
	       /file2_with_s10_file4.wav
	       ...
	    /s2
	       /file1_with_s12_file5.wav
	       /file2_with_s1_file1.wav
	       ...
	...
VALIDATION_SET
	...
TEST_SET
	...

Audio pre-processing

Compute power-law compressed spectrograms of mixed-speech audio samples. Repeat this operation for training, validation and test sets. Files are saved in NPY format.

av_speech_enhancement.py audio_preprocessing
	--data_dir <data_dir>
	--speaker_ids <spk1> <spk2> <...>
	--audio_dir <audio_dir>
	--dest_dir <dest_dir>
	--sample_rate <sample_rate>
	--max_wav_length <max_wav_length>

Video pre-processing

Extract face landmarks from video using Dlib face detector and face landmark extractor. Files are saved in TXT format (each row has 136 values that represents the flattened x-y values of 68 face landmarks).

av_speech_enhancement.py video_preprocessing
	--data_dir <data_dir>
	--speaker_ids <spk1> <spk2> <...>
	--video_dir <video_dir>
	--dest_dir <dest_dir>
	--shape_predictor <shape_predictor_file>
	--ext <video_file_extension>

<shape_predictor_file> contains the parameters of the face landmark extractor model. You can download a pre-trained model file here.

If you want to check the result of the face landmark extractor type:

av_speech_enhancement.py show_face_landmarks
	--video <video_file>
	--fps <fps>
	--shape_predictor <shape_predictor_file>

Computing Target Binary Masks

Compute TBMs from clean audio samples. For each speaker Long-Term Average Speech Spectrum (LTASS) is computed and then the threshold is applied to all clean audio samples in <audio_dir>.

av_speech_enhancement.py tbm_computation
	--data_dir <data_dir>
	--speaker_ids <spk1> <spk2> <...>
	--audio_dir <audio_dir>
	--dest_dir <dest_dir>
	--sample_rate <sample_rate>
	--max_wav_length <max_wav_length>

TFRecords generation

Before training you have to generate TFRecords of mixed-speech dataset. <data_dir>/<mix_dir> must have three subdirectories named TRAINING_SET, VALIDATION_SET and TEST_SET created with <mixed_speech_generator> subcommand. Pre-computed spectrogram (NPY format) must be located in the same directory of audio file. Set <tfrecords_mode> to "fixed" if samples of the dataset all have the same length (as in GRID corpus), otherwise use "var" (as in TCD-TIMIT corpus).

av_speech_enhancement.py tfrecords_generator
	--data_dir <data_dir>
	--num_speakers <number_speakers_mixed> {2,3}
	--mode <tfrecords_mode> {fixed,var}
	--dest_dir <dest_dir>
	--base_audio_dir <base_audio_dir>
	--video_dir <video_dir>
	--tbm_dir <tbm_dir>
	--mix_audio_dir <mix_audio_dir>
	--delta <delta_video_feat> {0,1,2]
	--norm_data_dir <normalization_data_dir>

Training

Train an audio-visual speech enhancement model described. You can choose between VL2M, VL2M_ref, Audio-Visual Concat and Audio-Visual Concat-ref models.

av_speech_enhancement.py training
	--data_dir <data_dir>
	--train_set <training_set_subdir>
	--validation_set <validation_set_subdir>
	--exp <experiment_id>
	--mode <tfrecords_mode> {fixed,var}
	--audio_dim <audio_frame_dimension>
	--video_dim <video_frame_dimension>
	--num_audio_samples <num_audio_samples>
	--model <model_selection> {vl2m,vl2m_ref,av_concat_mask,av_concat_mask_ref}
	--opt <optimizer_choice> {sgd,adam,momentum}
	--learning_rate <learning_rate>
	--updating_step <updating_step>
	--learning_decay <learning_decay>
	--batch_size <batch_size>
	--epochs <num_epochs>
	--hidden_units <num_hidden_lstm_units>
	--layers <num_lstm_layers>
	--dropout <dropout_rate>
	--regularization <regularization_weight>

Testing

Test your trained model. Enhanced speech samples and estimated masks are saved in <data_dir>/<output_dir>. Estimated masks are saved in subdirectories <mask_dir> of each speaker directory.

av_speech_enhancement.py testing
	--data_dir <data_dir>
	--test_set <training_set_subdir>
	--exp <experiment_id>
	--ckp <model_checkpoint>
	--mode <tfrecords_mode> {fixed,var}
	--audio_dim <audio_frame_dimension>
	--video_dim <video_frame_dimension>
	--num_audio_samples <num_audio_samples>
	--output_dir <output_dir>
	--mask_dir <mask_dir>

Reference

If this project is useful for your research, please cite:

@inproceedings{morrone2019face,
  title={Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments},
  author={Morrone, Giovanni and Bergamaschi, Sonia and Pasa, Luca and Fadiga, Luciano and Tikhanoff, Vadim and Badino, Leonardo},
  booktitle={2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={6900-6904},
  year={2019},
  organization={IEEE}
}

audio_visual_speech_enhancement's People

Contributors

dr-pato avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.