drethage / speech-denoising-wavenet Goto Github PK

View Code? Open in Web Editor NEW

667.0 18.0 165.0 58.71 MB

A neural network for end-to-end speech denoising

License: MIT License

Python 100.00%

machine-learning deep-learning neural-networks speech-denoising speech wavenet end-to-end speech-processing

speech-denoising-wavenet's People

Contributors

Stargazers

Watchers

Forkers

daitomanabe nieshaoshuai benjamesbabala peidong-wang p4rk3r weixsong hawklucky jdai158 aitorbajo alyreza rschmaelzle wyn314 zhaoforever xjc2016 uncledickhe michaelfeng87 simonqfchen codkui leedakyeong cc-cherie 307509256 shubhampachori12110095 agangzz alexmikhalev james-lh achiyaj fy378968174 danhuixie d3sm0 melspectrum007 gtawireless xanxus1204 ltganesan rockycamp wss546 zhf459 wxb506 nyk510 jinguang-dong avterekhov davincibj wantongtang xinkez slbinilkumar entn-at 00001101-xt iooops icefire-luo jiancao92 yes7rose fdeng1983 nd1511 r06944010 xgboosting orangebaowang alongwithyou rpersie dacson yunzqq normonisping afd77 wanglong001 byshiny hack121 jmqpool hiyoung-asr saurabh-kataria edwardyoon fishzhouz moizkachwala heossacer yfliao lightondust byfaith elbum jalal-abdulbaqi framsc lxhwz sahwar appletree123123 ianesten mntabassm leon-ahandler connectwithprakash sdozono lihao0214 xinming0411 grahambojangles speechdnn pranshu-garg townmi makinglong windstudent boozyguo hakanaku1234 dminustin v-yunbin xingws jbgh2 arpit-batra

speech-denoising-wavenet's Issues

ValueError: output of generator should be a tuple (x, y, sample_weight) or (x, y). Found: None

Getting two errors while training the Wavenet model on Google Colab cloud GPU.
1.
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 409, in data_generator_task
generator_output = next(generator)
File "/content/drive/app/speech-denoising-wavenet/datasets.py", line 125, in get_random_batch_generator
noise = noisy - speech
ValueError: operands could not be broadcast together with shapes (65302,) (60727,)

File "main.py", line 169, in
main()
File "main.py", line 163, in main
training(config, cla)
File "main.py", line 81, in training
config['training']['num_epochs'])
File "/content/drive/app/speech-denoising-wavenet/models.py", line 167, in fit_model
initial_epoch=self.epoch_num)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1481, in fit_generator
str(generator_output))
"ValueError: output of generator should be a tuple (x, y, sample_weight) or (x, y). Found: None".

Any idea what the issue is and how to resolve it? @drethage

Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.

Running main.py with arguments !THEANO_FLAGS=optimizer=fast_compile,device=gpu python /content/speech-denoising-wavenet/main.py --mode inference --config /content/speech-denoising-wavenet/sessions/001/config.json --noisy_input_path /content/speech-denoising-wavenet/data/NSDTSEA/noisy_testset_wav --clean_input_path /content/speech-denoising-wavenet/data/NSDTSEA/clean_testset_wav

Using TensorFlow backend.
/usr/lib/python2.7/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
WARNING: Logging before flag parsing goes to stderr.
W0724 03:39:50.798191 139764758083456 deprecation_wrapper.py:119] From /usr/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py:310: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0724 03:39:50.807167 139764758083456 deprecation.py:506] From /usr/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py:619: calling __init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0724 03:39:53.131911 139764758083456 deprecation.py:506] From /usr/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py:480: calling __init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Loading model from epoch: 144
W0724 03:39:53.568438 139764758083456 deprecation_wrapper.py:119] From /usr/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py:106: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

W0724 03:39:53.568799 139764758083456 deprecation_wrapper.py:119] From /usr/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py:111: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0724 03:39:53.569037 139764758083456 deprecation_wrapper.py:119] From /usr/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py:116: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-07-24 03:39:53.569346: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-07-24 03:39:53.574141: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-07-24 03:39:53.574383: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x28812c0 executing computations on platform Host. Devices:
2019-07-24 03:39:53.574417: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
W0724 03:39:53.574882 139764758083456 deprecation_wrapper.py:119] From /usr/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py:258: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

2019-07-24 03:39:53.858215: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
W0724 03:39:54.107969 139764758083456 deprecation_wrapper.py:119] From /usr/lib/python2.7/site-packages/keras/optimizers.py:610: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

Performing inference..

The script then stops.

Method to evaluate enhanced speech

Could you provide some additional information or evaluation script for enhanced speech? (to getting SIG, BAK, OVL)

out memory?

Denoising: p257_156.wav
0%| | 0/1 [00:00<?, ?it/s]terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)

cudnn version

thanks for your repo. Can you provide the cudnn, cudatoolkit and tensorflow_gpu version? I was confused by these versions.

Is there any possibilities to change into Tensorflow model?

We need to move it into Android, but there is no such interface to read from the checkpoints.

requirements doesn't include TF

Requirements.txt doesn't specify tensorflow version needed.

Any other audio format supported by Wavenet model (like .mp3) other than .wav?

@drethage Can you clarify if the training/test data needs to be in .wav format or it can support any other format? I have audio data in .mp3 format and converting it to .wav is resource intensive. How can I train and test the wavenet model with data in .mp3 format itself?

No optparse in python3

Dear,

In the python3+,No module named optparse, I find the instruction that
"Deprecated since version 2.7: The optparse module is deprecated and will not be developed further; development will continue with the argparse module."
and the link is here,

but could it be used with argparse ?

index 2 is out of bounds for axis 0 with size 2

I trained the model according to the usage instructions, and also downloaded the given data set, but the following error will occur during training. Can anyone help me?
Using Theano backend.
E:\anaconda\lib\site-packages\theano\gpuarray\dnn.py:184: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to a version >= v5 and <= v7.
warnings.warn("Your cuDNN version is more recent than "
Using cuDNN version 7401 on context None
Mapped name None to device cuda: GeForce GTX 1060 (0000:01:00.0)
Traceback (most recent call last):
File "main.py", line 169, in
main()
File "main.py", line 165, in main
inference(config, cla)
File "main.py", line 108, in inference
load_checkpoint=cla.load_checkpoint, print_model_summary=cla.print_model_summary)
File "E:\speech-denoising-wavenet-master\models.py", line 67, in init
self.model = self.setup_model(load_checkpoint, print_model_summary)
File "E:\speech-denoising-wavenet-master\models.py", line 76, in setup_model
model = self.build_model()
File "E:\speech-denoising-wavenet-master\models.py", line 220, in build_model
name='data_input_target_field_length')(data_expanded)
File "E:\anaconda\lib\site-packages\keras\engine\base_layer.py", line 457, in call
output = self.call(inputs, **kwargs)
File "E:\speech-denoising-wavenet-master\layers.py", line 47, in call
x = keras.backend.permute_dimensions(x, [0, 2, 1])
File "E:\anaconda\lib\site-packages\keras\backend\theano_backend.py", line 936, in permute_dimensions
y._keras_shape = tuple(np.asarray(x._keras_shape)[list(pattern)])
IndexError: index 2 is out of bounds for axis 0 with size 2
@drethage

sor with shape[655360,31,1,256] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu

Denoised audio has 0db?

I do not know why. All the audio after denoising is almost silent.
`
{
"dataset": {
"extract_voice": true,
"in_memory_percentage": 1,
"noise_only_percent": 0.02,
"num_condition_classes": 29,
"path": "data/ShowerNoise/",
"regain": 0.06,
"sample_rate": 16000,
"type": "nsdtsea"
},
"model": {
"condition_encoding": "binary",
"dilations": 7,
"filters": {
"lengths": {
"res": 3,
"final": [3, 3],
"skip": 1
},
"depths": {
"res": 128,
"skip": 128,
"final": [2048, 256]
}
},
"num_stacks": 3,
"target_field_length": 1601,
"target_padding": 1
},
"optimizer": {
"decay": 0.0,
"epsilon": 1e-08,
"lr": 0.001,
"momentum": 0.9,
"type": "adam"
},
"training": {
"batch_size": 4,
"early_stopping_patience": 16,
"loss": {
"out_1": {
"l1": 1,
"l2": 0,
"weight": 1
},
"out_2": {
"l1": 1,
"l2": 0,
"weight": 1
}
},
"num_epochs": 15,
"num_test_samples": 50,
"num_train_samples": 450,
"path": "sessions/ShowerNoise",
"verbosity": 1
}
}

I have 500 audio files for training, and inside them, there are 100 files are clean audio, 7 files are noise-only with the silence output. I do not know why.

PESQ and STOI

can you share your pesq and stoi? I evaluate testset, but i find that the results is bad, logmmse method pesq:2.837,stoi:0.89, wavenet-denoised method：pesq:2.34 stoi:0.92. how can i modify it?

Readme requirements

Readme requirements doesn't specify python version.

Difference Between this and others

So apparently https://github.com/francoisgermain/SpeechDenoisingWithDeepFeatureLosses is picking up some steam... what are the differences betweent eh two systems?

Any way to prevent early stopping?

Hi,
I'm using a P100 GPU and tried to retrained the model（pretrained is too large）, but I found it always
gose to early stopping at small epoch， any way to do more epoch ？
I'm using default config and modified to used less memory by reducing "dilations" to 5

Epoch 34/250
1000/1000 [==============================] - 66s - loss: 0.0765 - data_output_1_loss: 0.0382 - data_output_2_loss: 0.0382 - data_output_1_mean_absolute_error: 0.0382 - data_output_1_valid_mean_absolute_error: 0.0382 - data_output_2_mean_absolute_error: 0.0382 - data_output_2_valid_mean_absolute_error: 0.0382 - val_loss: 0.0749 - val_data_output_1_loss: 0.0374 - val_data_output_2_loss: 0.0374 - val_data_output_1_mean_absolute_error: 0.0374 - val_data_output_1_valid_mean_absolute_error: 0.0374 - val_data_output_2_mean_absolute_error: 0.0374 - val_data_output_2_valid_mean_absolute_error: 0.0374
Epoch 00033: early stopping

About get_dataset(config, model) function

def get_dataset(config, model):

    if config['dataset']['type'] == 'vctk+demand':
        return datasets.VCTKAndDEMANDDataset(config, model).load_dataset()
    elif config['dataset']['type'] == 'nsdtsea':
        return datasets.NSDTSEADataset(config, model).load_dataset()

I can not find the the class of VCTKAndDEMANDDataset().
Thanks.

Denoised audio 0db on NSDTSEA

The training process starting with the code according to "readme.md" has been early terminated at epoch 34 with loss 0.034, I consider the trained model has satisfied some requirements and could deal with denoising but I got the 0dB audio which contains nothing but silence. I do not know whether it is related to the early stop but the very low loss makes it confusing.

speed up training

I like this framework. I tested both training and denoising phases and it works. I was wondering about ways to speed up the training phase? aside from using a GPU, what other tricks can I use to speed up the training or what ideal parameters to use, etc? thanks a lot for your time and research!

Implement the code

Why do we need to input clean sound? Isn't it possible to output a clean sound file just by inputting noisy sound?

Why does the trained model need clean audio files for inferencing on the same noisy audio files?

Ideally, while inferencing the trained model should take the noisy audio data and output the denoised/cleaned version of the same audio files. I have some noisy audio files on which I want to run the pretrained model and get the clean files. But, while inferencing, it asks for path to clean audio data too which I don't have and the model doesn't run without it.
Why is a previously trained model asking for clean audio files? @drethage

Model just stops when denoising

Everytime I run the model with the regular inputs from the readme I get this.
Denoising: p232_001.wav
0%| | 0/1 [00:00<?, ?it/s]
(wavenet) C:\Users\JSai\Documents\Cochlear_Implants\speech-denoising-wavenet>

ANd it terminates.
Any help would be great.

got wrong array shape

we have trained using our test data, but there are errors reported as

ValueError: Error when checking : expected condition_input to have shape (None, 1) but got array with shape (1, 5)

I wonder what the errors may be like, how we can fix it, and how are the designated target wav can be created. Noted that we have only clean wav files with 16000 sample rate.

Thanks

Where is the pre-trained model?

Hi! I couldn't find the pre-trained model...it was supposed to have the folder sessions/001/models?

Nvidia driver version exception

I am trying to run this model on my files. I have followed the instructions in the documentation. When I run this command:

THEANO_FLAGS=optimizer=fast_compile,device=gpu KERAS_BACKEND=theano python main.py --mode inference --target_field_length 16001 --batch_size 4 --config sessions/001/config.json --noisy_input_path ../noisy

I get an exception,

Exception: The nvidia driver version installed with this OS does not give good results for reduction.

I am cuda9.0 and I had installed all the requirements in the requirements.txt. Has anyone faced a similar issue.

Getting Out-of-memory (OOM) error on running the model on GPU.

On running the same inference command given in the readme.md, I am getting the following OOM error. I am running it on Intel Core i5 7th gen CPU with 8GB RAM and NVidia 940MX 4GB GPU, Keras 1.2 and Theano 0.9.0.

THEANO_FLAGS=optimizer=fast_compile,device=gpu python main.py --mode inference --config sessions/001/config.json --noisy_input_path data/NSDTSEA/noisy_testset_wav --clean_input_path data/NSDTSEA/clean_testset_wav

Using TensorFlow backend.
/usr/local/lib/python2.7/dist-packages/h5py/init.py:34: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Loading model from epoch: 144
2018-02-18 17:40:19.280369: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-02-18 17:40:19.486539: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-02-18 17:40:19.486944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce 940MX major: 5 minor: 0 memoryClockRate(GHz): 1.2415
pciBusID: 0000:01:00.0
totalMemory: 3.95GiB freeMemory: 3.67GiB
2018-02-18 17:40:19.486961: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce 940MX, pci bus id: 0000:01:00.0, compute capability: 5.0)
Performing inference..
Denoising: p232_001.wav
0%| | 0/2 [00:00<?, ?it/s]
2018-02-18 17:40:23.358141: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.01GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2018-02-18 17:40:33.358618: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 696.00MiB. Current allocation summary follows.
.
.
Stats:
Limit: 3605921792
InUse: 3542674176
MaxInUse: 3542674176
NumAllocs: 973
MaxAllocSize: 464153344
.
.
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[192,128,2770,1]
Allocator (GPU_0_bfc) ran out of memory trying to allocate 389.18MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.

@drethage How to solve this error? If you need anymore information please let me know.

type error

System give a error: the parameter of data_expanded is float type, but it should be int type. So I switch the type to int, system also give a error: the parameter of data_expanded is int type, but it should float type. And I'm fuzzed!

Bad denoising results on the provided model

While the provided model in sessions/001 works well on NSDTSEA test files, the results on my own noisy files (recorded in real conditions) are much worse.

What did I do:

Cloned the entire project
Put my files into the 'test' subdir of the project dir
Run the command
THEANO_FLAGS=optimizer=fast_compile,device=gpu python main.py --mode inference --config sessions/001/config.json --noisy_input_path test
Took results from sessions/001/samples/samples_1

Large part of speech was suppressed, although SNR is not very low.
Maybe I do something wrong?

To try it yourself: https://drive.google.com/open?id=1njlPLNjbTuY1QlW_19y06a1ywuImBUHo

parameter size

As i use the python 3, debuging this code meet some problems.

I want to know the model size/ parameter number, so could you tell me the szie of the model? number of parameters,

Thank you !

Qiquan Zhang

Meaning of "in_memory_percentage" config parameter

The config file documentation mentions that:
"...in_memory_percentage: (float) Percentage of the dataset to load into memory, useful when dataset requires more memory than available..."

Does it mean that if I set the value to "1.00" the training toolchain will try load 100% of the recordings to memory?
And when I set it to "0.10" the training toolchain will try load only 10% of the recordings to memory at first, but later it will eventually load the other 90% during subsequent iterations - or it will just use 10% in total and ignore the other 90%?

Trying to reproduce the results/ Error with pygpu

Hi,

I am trying to reproduce the repository. However, I am facing the problem with running and using Theano. Can you tell me which NVIDIA-driver version and cuda, cudnn versions you have used. Also, if you can tell which pygpu version was used to run this project.

Thanks,
Akhil

Can't fine model

How to predict not a single value but a distribution as in original Wavenet?

I would like to predict a distribution but not a single value per sample, like original Wavenet does. What should I change in source code (I have seen there is some preparation for this in util.py where sound is converted from linear to ulaw and back...). Help please

Has someone do inference (denoise an audio) successfully?

I have been digging into the code but I haven't been able to make it work, what I want to achieve is just to denoise files.

What flow have you follow to achieve it?

Some samples of the denoised speech wave is missing in the end of the speech

Hi, I used this code and the finetuned model directly. But I encountered a problem. The output enhanced wave file is not as long as the original noisy wave. Some samples seemed to be missing in the end of the input noisy wave. I can't figure out what is the reason.
Thanks very much.

Value error

ValueError: output of generator should be a tuple (x, y, sample_weight) or (x, y). Found: None
I was trying to train with my new set of data .

TypeError: 'float' object cannot be interpreted as an integer

Windows 10, python 3.6.3

Full error when trying to run main.py:

 Using TensorFlow backend.
 Traceback (most recent call last):
   File "\speech-denoising-wavenet-master\main.py", line 169, in <module>
     main()
   File "speech-denoising-wavenet-master\main.py", line 163, in main
     training(config, cla)
   File "\speech-denoising-wavenet-master\main.py", line 72, in training
     model = models.DenoisingWavenet(config, load_checkpoint=cla.load_checkpoint, print_model_summary=cla.print_model_summary)
   File "speech-denoising-wavenet-master\models.py", line 50, in __init__
     self.samples_of_interest_indices = self.get_padded_target_field_indices()
   File "speech-denoising-wavenet-master\models.py", line 184, in get_padded_target_field_indices
     target_sample_index + self.half_target_field_length + self.target_padding + 1)
 TypeError: 'float' object cannot be interpreted as an integer

Issue on using the trained model

Hi,
initially I could not train the model using the batch size 10, due to the memory issue. Then I started training model using batch size of 5 (edits the config.json file) without touching any other parms, The algorithm stops after 41 epochs due to the early stopping condition. I used the 41th check point as the model parameters while testing, the returned denoised wave file is empty (full silence).
what would be the problem? is the early stopping causes the null model? does the batch size really matter while testing a model?

How you got the best performing model

Hello,
When I run on the same NSDTSEA without changing any of the config.json parameters, the model returned has a validation error of 0.013. If you don't mind, could you tell me how you got the validation error as 0.00144 for the model you displayed. Is there any special parameters to tune to achieve that performance?.

What kind of HW is needed to run the "best performing model"?

Hi,
I have a setup with NVIDIA GTX 1080 card with 8GB RAM but I'm still unable to perform a successfull denoising run on the default dataset without getting an out-of-memory error from TensorFlow :(
I'm currently retraining the model with a smaller "dilations" parameter value, but just out of curiosity - what kind of HW are you using to run denoising without OOM error on your side?

No version in python3 ?

Dear,

Could U please update the project with python3 ?

Thx

Denoised Speech is silence

I follow the instruction on readme file. Also I used the NSDTSEA dataset to rebuild the results, but the output speech is 0dB, and they are all silence. So I am wondering did I miss something that might lead to this result? Also, I could not find the model under sessions/001/, so I used the configuration under sessions/001/ to train the model, and used the trained model to denoise, but it still give me silence for the results. Could you tell me what else I could do to debug?