Speech_enhancement_using_convolutional_autoencoders

This repository explores speech enhancement using Convolutional Autoencoders.

Table of Contents

Description
Methodology
How to run
References

Description

In this repository we explore how UNet model or Convolutional Autoencoders with skip connections can be used for speech enhancement. The u-net is convolutional network architecture for fast and precise segmentation of images it was first used in biomedical engineering. But we can take advantage of this network for denoising noisy speech data. Intutively we can consider log-spectrograms as 2D images fed to this network from which relevant features are encoded and then converted back to another spectrogram with reduced noise.

Methodolgy

Steps involved in denoising Noisy speech data:

Generating noisy speech files by adding differnt kinds of noise to clean speech files.
Convert noisy speech files to time series data (Waveform model).
Convert time series data to log-spectrograms.
Training U-Net model to learn noise spectrograms.
Generating clean speech spectrogram and converting those back to .wav files.
Calculate WER (Word error rate), MER (Match error rate) and WIL (Word information lost) to evaluate performance of the model.

Generating noisy speech files

We can use script provided by MS-SNSD to generate required noisy speech files. I have already generated few for my project you can access those through this drive link.

Convert noisy speech files to time series data

To load .wav files for computation we can use librosa.load with a sampling rate of 8000. For computational purposes we will limit the size of each extracted audio files to 2*8064 (~2 sec) and stack them in 2-D array of size (number_of_audio_files X 8064).

Convert time series data to log-spectrograms

Time series data has no information regarding frequency so we will use fourier transformation for this. In particular we will calculate short time fourier transformation of time series data to output complex matrix (spectrogram) which gives the clear picture of how different frequency components are evolving with time. Since humans preception of sound is logarithmic so we will convert our spectrograms into log-spectrograms i.e. convert power into decibels. For model input we only need magnitude part and for converting back spectrograms to audio files complex part is required. We will generate spectrograms for both noisy speech files and clean speech files.

Training U-Net model to learn noise spectrograms

U-Net model at its core is convolutional autoencoder with skip connections. In this model max-pooling is used in encoding part and up-sampling is used in decoding part. Skip connections are added from encoder to decoder to tackle vanishing gradient issue. U-Net model takes noisy speech spectrogram as input and noise speech spectrogram as output which is equal to noisy speech spectrogram - clean speech spectrogram.

Generating clean speech spectrogram and converting those back to .wav files

To Generate clean speech spectrograms we will use noise spectrograms generated from our model and subtract it from noisy speech spectrograms given as input. To convert log-spectrogram back to audio file we will first generate complex matrix by multiplying magnitude and its respective phase spectrograms and then use librosa.core.istft to generate time series data which can be converted back to .wav file using soundfile.write

Input	Output
Input audio file - 1	Output audio file - 1
Input audio file - 2	Output audio file - 2

Calculate metrics to evaluate performance of the model

Finally to evaluate the performace of our model we will use WER (Word error rate), MER (Match error rate) and WIL (Word information lost) metrics. Thanks to my friend Simon who has prepared one notebook for the same (Link).

How to run

There are 2 options to denoise data :

Retrain the whole network on some new data.
Use pre-trained model from Models folder and get the denoised output.

References

Research papers for speech-enhancement Link
Speech Enhancement In Multiple-Noise Conditions using Deep Neural Networks by Anurag Kumar, Dinei Florencio. Link
A FULLY CONVOLUTIONAL NEURAL NETWORK FOR SPEECH ENHANCEMENT by Se Rim Park and Jin Won Lee. Link
Speech Denoising DNN Link
Sound Of AI youtube channel Link
Digital signal processing Link
Speech Enhancement Link
Unet Model Link
Why Skip Connections are needed in Unet Link
Understanding Semantic Segmentation with Unet Link
Understanding Up Sampling Link
Understanding Convolutional Neural Networks Link
Padding in CNN Link
Denoising-images Link

kaushalnaresh / speech_enhancement_using_convolutional_autoencoders Goto Github PK