Giter Club home page Giter Club logo

wave-u-net-for-speech-enhancement's Introduction

Improved Speech Enhancement with the Wave-U-Net

The Wave-U-Net applied to speech enhancement [1], an adaptation of the original implementation for music source separation by Stoller et al [2].

The Wave-U-Net is a convolutional neural network applicable to audio source separation tasks, recently introduced by Stoller et al for the separation of music vocals and accompaniment [2]. A 1D convolutional time domain variant of the 2D convolutions within the U-Net [3], this end-to-end learning method for audio source separation operates directly in the time domain, permitting the integrated modelling of phase information and being able to take large temporal contexts into account.

Experiments on audio source separation for speech enhancement in [1] show that the proposed method rivals state-of-the-art architectures, improving upon various metrics, namely PESQ, CSIG, CBAK, COVL and SSNR, with respect to the single-channel speech enhancement task on the Voice Bank corpus (VCTK) dataset. Future experimentation will focus on increasing effectiveness and efficiency by further adapting the model size and other parameters, e.g. filter sizes, to the task and expanding to multi-channel audio and multi-source-separation.


Architecture

The architecture is the same as that employed in [2] with the exception of the number of hidden layers and the validation set size. The number of hidden layers was experimented with and the results suggest the optimum size to be 9 layers.

See diagram below for a summary visual representation of the architecture:

alt text

Initialisation

Requirements

Under our implementation, training took c.36 hours using GeForce GTX 1080 Ti GPU with 11178 MiB, on Linux Ubuntu 16.04, with Python 2.7. In a new virtual environment, required Python 2.7 packages can be installed using pip install -r requirements.txt. N.B. this presumes the installation of ffmpeg and libsndfile.


Data Preparation

Train and test datasets provided by the 28-speaker Voice Bank Corpus (VCTK) [4] (30 speakers in total - 28 intended for training and 2 reserved for testing). The noisy training data were generated by mixing the clean data with various noise datasets, as per the instructions provided in [4, 5, 6].

The training dataset should then be prepared for being parsed as an XML file (not provided) using the ElementTree XML API in Datasets.getAudioData.


Training

Training can be executed by running the command python Training.py, modyfing the parameters in Config.py as desired.

Testing

Testing experiments can be performed by running the command python Test_Predictions_VCTK.py.

Speech source estimates should then be evaluated against the clean speech file they are estimating. This can be done using Evaluate.m, which selects multiple files and performs the composite.m script [7] (available here) upon each one, calculating the PESQ, SSNR, CSIG, CBAK and COVL.

Audio examples of both speech and background noise estimates of the VCTK test set, alongside the noisy test files and clean speech for reference, are available for download in the audio_examples directory.


References

[1] Craig Macartney and Tillman Weyde. Improved Speech Enhancement with the Wave-U-Net. 2018. URL http://arxiv.org/abs/1811.11307

[2] Daniel Stoller, Sebastian Ewert, and Simon Dixon. Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation. 6 2018. URL https://arxiv.org/abs/1806.03185.

[3] Andreas Jansson, Eric J. Humphrey, Nicola Montecchio, Rachel M. Bittner, Aparna Kumar, and Tillman Weyde. Singing voice separation with deep u-net convolutional networks. In Proceedings of the 18th International Society for Music Information Retrieval Conference, ISMIR 2017, Suzhou, China, October 23-27, 2017, pages 745–751, 2017. URL https://ismir2017.smcnus.org/wp-content/uploads/2017/10/171_Paper.pdf.

[4] Cassia Valentini-Botinhao. Noisy speech database for training speech enhancement algorithms and TTS models, 2016 [sound]. University of Edinburgh. School of Informatics. Centre for Speech Technology Research (CSTR), 2017. URL http://dx.doi.org/10.7488/ds/2117.

[5] Santiago Pascual, Antonio Bonafonte, and Joan Serrà. SEGAN: Speech Enhancement Generative Adversarial Network. doi: 10.7488/ds/1356. URL http://dx.doi.org/10.7488/ds/1356.

[6] Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, and Junichi Yamagishi. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech. Technical report. URL https://www.research.ed.ac.uk/portal/files/26581510/SSW9_Cassia_1.pdf.

[7] Philipos C Loizou. Speech Enhancement: Theory and Practice. CRC Press, Inc., Boca Raton, FL, USA, 2nd edition, 2013. ISBN 1466504218, 9781466504219.

wave-u-net-for-speech-enhancement's People

Contributors

craigmacartney avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.