Giter Club home page Giter Club logo

speech_enhancement_using_convolutional_autoencoders's Introduction

Speech_enhancement_using_convolutional_autoencoders

This repository explores speech enhancement using Convolutional Autoencoders.

Table of Contents

  1. Description
  2. Methodology
  3. How to run
  4. References

Description

In this repository we explore how UNet model or Convolutional Autoencoders with skip connections can be used for speech enhancement. The u-net is convolutional network architecture for fast and precise segmentation of images it was first used in biomedical engineering. But we can take advantage of this network for denoising noisy speech data. Intutively we can consider log-spectrograms as 2D images fed to this network from which relevant features are encoded and then converted back to another spectrogram with reduced noise.

Methodolgy

Steps involved in denoising Noisy speech data:

  1. Generating noisy speech files by adding differnt kinds of noise to clean speech files.
  2. Convert noisy speech files to time series data (Waveform model).
  3. Convert time series data to log-spectrograms.
  4. Training U-Net model to learn noise spectrograms.
  5. Generating clean speech spectrogram and converting those back to .wav files.
  6. Calculate WER (Word error rate), MER (Match error rate) and WIL (Word information lost) to evaluate performance of the model.

Generating noisy speech files

We can use script provided by MS-SNSD to generate required noisy speech files. I have already generated few for my project you can access those through this drive link.

Convert noisy speech files to time series data

To load .wav files for computation we can use librosa.load with a sampling rate of 8000. For computational purposes we will limit the size of each extracted audio files to 2*8064 (~2 sec) and stack them in 2-D array of size (number_of_audio_files X 8064).

alt text

Convert time series data to log-spectrograms

Time series data has no information regarding frequency so we will use fourier transformation for this. In particular we will calculate short time fourier transformation of time series data to output complex matrix (spectrogram) which gives the clear picture of how different frequency components are evolving with time. Since humans preception of sound is logarithmic so we will convert our spectrograms into log-spectrograms i.e. convert power into decibels. For model input we only need magnitude part and for converting back spectrograms to audio files complex part is required. We will generate spectrograms for both noisy speech files and clean speech files.

alt text

Training U-Net model to learn noise spectrograms

U-Net model at its core is convolutional autoencoder with skip connections. In this model max-pooling is used in encoding part and up-sampling is used in decoding part. Skip connections are added from encoder to decoder to tackle vanishing gradient issue. U-Net model takes noisy speech spectrogram as input and noise speech spectrogram as output which is equal to noisy speech spectrogram - clean speech spectrogram.

alt text

Generating clean speech spectrogram and converting those back to .wav files

To Generate clean speech spectrograms we will use noise spectrograms generated from our model and subtract it from noisy speech spectrograms given as input. To convert log-spectrogram back to audio file we will first generate complex matrix by multiplying magnitude and its respective phase spectrograms and then use librosa.core.istft to generate time series data which can be converted back to .wav file using soundfile.write

alt text

Input Output
Input audio file - 1 Output audio file - 1
Input audio file - 2 Output audio file - 2

Calculate metrics to evaluate performance of the model

Finally to evaluate the performace of our model we will use WER (Word error rate), MER (Match error rate) and WIL (Word information lost) metrics. Thanks to my friend Simon who has prepared one notebook for the same (Link).

How to run

There are 2 options to denoise data :

  1. Retrain the whole network on some new data.
  2. Use pre-trained model from Models folder and get the denoised output.

References

  1. Research papers for speech-enhancement Link
  2. Speech Enhancement In Multiple-Noise Conditions using Deep Neural Networks by Anurag Kumar, Dinei Florencio. Link
  3. A FULLY CONVOLUTIONAL NEURAL NETWORK FOR SPEECH ENHANCEMENT by Se Rim Park and Jin Won Lee. Link
  4. Speech Denoising DNN Link
  5. Sound Of AI youtube channel Link
  6. Digital signal processing Link
  7. Speech Enhancement Link
  8. Unet Model Link
  9. Why Skip Connections are needed in Unet Link
  10. Understanding Semantic Segmentation with Unet Link
  11. Understanding Up Sampling Link
  12. Understanding Convolutional Neural Networks Link
  13. Padding in CNN Link
  14. Denoising-images Link

speech_enhancement_using_convolutional_autoencoders's People

Contributors

kaushalnaresh avatar

Stargazers

 avatar dairj avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.