funcwj / conv-tasnet Goto Github PK

A PyTorch implementation of "TasNet: Surpassing Ideal Time-Frequency Masking for Speech Separation" (see recipes in aps framework https://github.com/funcwj/aps)

License: MIT License

Python 99.14% Shell 0.86%

tasnet speech-separation pytorch

conv-tasnet's Introduction

ConvTasNet

A PyTorch implementation of the TasNet: Surpassing Ideal Time-Frequency Masking for Speech Separation

Requirements

see requirements.txt

Usage

training: configure conf.py and run train.sh
inference

./nnet/separate.py /path/to/checkpoint --input /path/to/mix.scp --gpu 0 > separate.log 2>&1 &

evaluate

./nnet/compute_si_snr.py /path/to/ref_spk1.scp,/path/to/ref_spk2.scp /path/to/inf_spk1.scp,/path/to/inf_spk2.scp

Result (on best configuratures in the paper)

ID	Settings	Causal	Norm	Param	Loss	Si-SDR
0	adam/lr:1e-3/wd:1e-5/32-batch/2gpu	N	BN/relu	8.75M	-17.59/-15.45	14.63
1	adam/lr:1e-2/wd:1e-5/20-batch/2gpu	N	gLN/relu	-	-16.09/-15.21	14.58
2	adam/lr:1e-3/wd:1e-5/20-batch/2gpu	N	gLN/relu	-	-17.91/-16.54	15.87
3	adam/lr:1e-2/wd:1e-5/32-batch/2gpu	N	BN/sigmoid	-	-14.51/-13.40	12.62
4	adam/lr:1e-2/wd:1e-5/32-batch/2gpu	N	BN/relu	-	-17.20/-15.38	14.58
5	adam/lr:1e-3/wd:1e-5/20-batch/2gpu	N	gLN/sigmoid	-	-17.20/-16.11	15.55
6	adam/lr:1e-3/wd:1e-5/32-batch/2gpu	Y	BN/relu	-	-15.25/-12.47	11.42
7	adam/lr:1e-3/wd:1e-5/24-batch/2gpu	N	cLN/relu	-	-18.72/-16.17	15.25

Reference

Luo Y, Mesgarani N. TasNet: Surpassing Ideal Time-Frequency Masking for Speech Separation[J]. arXiv preprint arXiv:1809.07454, 2018.

conv-tasnet's People

Contributors

Stargazers

Watchers

conv-tasnet's Issues

GPU Specs

Hi, Thanks for the repo! what is your GPU specs. I am not able to fit 32 or 20 batches in 2 GPUs. I use a TitanX 12 GB and Tesla k40 12GB. Also can you share your loss curve?

Specify on which split is each loss in the "Result" table (README.md)

Hi,
Thanks for sharing your implementation. To understand the performance of your system, could you precise on which sets are computed the columns "Loss" and "SI-SDR" in the table "Result" or your README? Is it train/valid/test? For example, my guess for line 0 of the table would be train_si_sdr=17.59, valid_si_sdr = 15.45 and test_si_sdr=14.63, is it right?

Pertained model

Hey,
Thank you for the implementation!
Can you please share a pre-trained model?

raise KeyError("Missing utterance {}!".format(index)) KeyError: 'Missing utterance clnsp1!'

Traceback (most recent call last):
File "train.py", line 86, in
run(args)
File "train.py", line 50, in run
trainer.run(train_loader, dev_loader, num_epochs=args.epochs)
File "/hardmnt/moissan0/home/mnabih/piccadilly0/home/mnabih/PycharmProjects/Co/nnet/libs/trainer.py", line 215, in run
cv = self.eval(dev_loader)
File "/hardmnt/moissan0/home/mnabih/piccadilly0/home/mnabih/PycharmProjects/Co/nnet/libs/trainer.py", line 203, in eval
for egs in data_loader:
File "/hardmnt/moissan0/home/mnabih/piccadilly0/home/mnabih/PycharmProjects/Co/nnet/libs/dataset.py", line 143, in iter
for chunks in self.eg_loader:
File "/hardmnt/moissan0/home/mnabih/piccadilly0/home/mnabih/PycharmProjects/Co/lib64/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/hardmnt/moissan0/home/mnabih/piccadilly0/home/mnabih/PycharmProjects/Co/lib64/python3.6/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/hardmnt/moissan0/home/mnabih/piccadilly0/home/mnabih/PycharmProjects/Co/lib64/python3.6/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/hardmnt/moissan0/home/mnabih/piccadilly0/home/mnabih/PycharmProjects/Co/lib64/python3.6/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/hardmnt/moissan0/home/mnabih/piccadilly0/home/mnabih/PycharmProjects/Co/lib64/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/hardmnt/moissan0/home/mnabih/piccadilly0/home/mnabih/PycharmProjects/Co/lib64/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/hardmnt/moissan0/home/mnabih/piccadilly0/home/mnabih/PycharmProjects/Co/lib64/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/hardmnt/moissan0/home/mnabih/piccadilly0/home/mnabih/PycharmProjects/Co/nnet/libs/dataset.py", line 42, in getitem
ref = [reader[key] for reader in self.ref]
File "/hardmnt/moissan0/home/mnabih/piccadilly0/home/mnabih/PycharmProjects/Co/nnet/libs/dataset.py", line 42, in
ref = [reader[key] for reader in self.ref]
File "/hardmnt/moissan0/home/mnabih/piccadilly0/home/mnabih/PycharmProjects/Co/nnet/libs/audio.py", line 119, in getitem
raise KeyError("Missing utterance {}!".format(index))
KeyError: 'Missing utterance clnsp10!'

How to do multiprocessing in dataloader?

This code really helps. However, when the batch size is large, loading data becomes a bottleneck for training. There's no implementation of multi worker in dataloader. Hope to solve this problem. Any suggestions about this issue?

Softmax

Hi, i see u have tabulated all results based on Relu or Sigmoid as the non-linear layer. Did you try using softmax?If yes, how were the results? The paper architecture has softmax layer after we get the masks.

TIME DOMAIN AUDIO VISUAL SPEECH SEPARATION

作者您好，对您的TIME DOMAIN AUDIO VISUAL SPEECH SEPARATION这篇论文，很感兴趣，这篇论文的源码可以公开吗？

The scale of the output (the predicted speech) is not consistent with the input scale (the input mixture).

Hi, thanks for sharing your code. I have a question in seperate.py, specifically about the code below
def run(args): mix_input = WaveReader(args.input, sample_rate=args.fs) computer = NnetComputer(args.checkpoint, args.gpu) for key, mix_samps in mix_input: logger.info("Compute on utterance {}...".format(key)) spks = computer.compute(mix_samps) norm = np.linalg.norm(mix_samps, np.inf) for idx, samps in enumerate(spks): samps = samps[:mix_samps.size] samps = samps * norm / np.max(np.abs(samps)) write_wav( os.path.join(args.dump_dir, "spk{}/{}.wav".format( idx + 1, key)), samps, fs=args.fs)

I found the separated speeches are not in the same energy level as that of the mixture, though the performance is really great. I debug the code and surprisingly found that max(mix_samps) is around 1, while max(samps) is around 200 before doing norm.

I don't think we should do norm here, since this changes the separated speech energy level.
samps = samps * norm / np.max(np.abs(samps))
However, without normalization, as I said, max(samps) is around 200. Do you know why? And how to solve this?