xflick / eend_pytorch Goto Github PK
View Code? Open in Web Editor NEWA PyTorch implementation of End-to-End Neural Diarization
License: MIT License
A PyTorch implementation of End-to-End Neural Diarization
License: MIT License
Hi,
Thanks again for open-sourcing the work and the response to my pretrained model issue before.
I do have a question: if the chunk size is smaller than the entire input length during inference which means we will get more than one chunk for one file, the speaker id across each chunk will not be consistent, right? because of the permutation.
Just to confirm that you do not have any post processing when writing out rttm.
First of all thank you for open source this code
There is no folder named exp_large, and some files in this directory such as avg.th or transformer.th are not found, can you provide these files? or retraining does not require these files, then how should I pass in these files?.
And can the adapt stage be skipped?
Sorry, but which numbers of LDC data does SWBD part 1,2 refer to, as there are.
LDC2001S13 Switchboard Cellular Part 1 Audio
LDC2001S15 Switchboard Cellular Part 1 Transcribed Audio
I don't know if S15 is needed. Also, I'd like to ask if you can share Switchboard phase 3 so I can use Switchboard phase 1 as a swap.
(Please forgive me if I offend you, but it is because of the lack of data sets and the teacher's reluctance to buy them).
不好意思,我想请问一下SWBD具体是要LDC的哪几个数据集,因为Part I有S13和S15,我不知道S15是否需要。另外,我想问下是否可分享一下Switchboard phase 3,我可以用Phase 1做为交换(对于这个问题,如有冒犯请见谅,实在是因为缺少数据集,且老师好像没有购买意愿)。
Hi,
Thank you for implementing and sharing this. I am trying to infer my data with your pre-trained model. But I am new to PyTorch and it was not clear to me how to set up the correct environment for running the inferring. Can you tell me what packages are needed please? Thank you.
Hi,
Thanks for open-sourcing this work and providing the pretrained large model.
As you mentioned that your version missed the SwitchBoard Phase 1 for training data while your colleague got the one on full. I'm wondering the pretrained large in this repo is on full or without SwitchBoard Phase 1?
Thank you for open sourcing your work.
I'm trying to familiarize myself with the repo and tried the inference with the provided pretrained model. I get error when loading the model. I would greatly appreciate your help with this issue.
Here are the arguments I pass to infer.py
:
"${workspaceFolder}/conf/large/infer.yaml",
"${workspaceFolder}/dataset",
"${workspaceFolder}/pretrained_models/large/model_callhome.th",
"${workspaceFolder}/outpath",
"--model_type",
"Transformer",
"--gpu",
"-0",
And here is the error:
File "/EEND_PyTorch/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TransformerModel:
Missing key(s) in state_dict: "encoder.weight", "encoder.bias", "encoder_norm.weight", "encoder_norm.bias", "transformer_encoder.layers.0.self_attn.in_proj_weight", "transformer_encoder.layers.0.self_attn.in_proj_bias", "transformer_encoder.layers.0.self_attn.out_proj.weight", "transformer_encoder.layers.0.self_attn.out_proj.bias", "transformer_encoder.layers.0.linear1.weight", "transformer_encoder.layers.0.linear1.bias", "transformer_encoder.layers.0.linear2.weight", "transformer_encoder.layers.0.linear2.bias", "transformer_encoder.layers.0.norm1.weight", "transformer_encoder.layers.0.norm1.bias", "transformer_encoder.layers.0.norm2.weight", "transformer_encoder.layers.0.norm2.bias", "transformer_encoder.layers.1.self_attn.in_proj_weight", "transformer_encoder.layers.1.self_attn.in_proj_bias", "transformer_encoder.layers.1.self_attn.out_proj.weight", "transformer_encoder.layers.1.self_attn.out_proj.bias", "transformer_encoder.layers.1.linear1.weight", "transformer_encoder.layers.1.linear1.bias", "transformer_encoder.layers.1.linear2.weight", "transformer_encoder.layers.1.linear2.bias", "transformer_encoder.layers.1.norm1.weight", "transformer_encoder.layers.1.norm1.bias", "transformer_encoder.layers.1.norm2.weight", "transformer_encoder.layers.1.norm2.bias", "transformer_encoder.layers.2.self_attn.in_proj_weight", "transformer_encoder.layers.2.self_attn.in_proj_bias", "transformer_encoder.layers.2.self_attn.out_proj.weight", "transformer_encoder.layers.2.self_attn.out_proj.bias", "transformer_encoder.layers.2.linear1.weight", "transformer_encoder.layers.2.linear1.bias", "transformer_encoder.layers.2.linear2.weight", "transformer_encoder.layers.2.linear2.bias", "transformer_encoder.layers.2.norm1.weight", "transformer_encoder.layers.2.norm1.bias", "transformer_encoder.layers.2.norm2.weight", "transformer_encoder.layers.2.norm2.bias", "transformer_encoder.layers.3.self_attn.in_proj_weight", "transformer_encoder.layers.3.self_attn.in_proj_bias", "transformer_encoder.layers.3.self_attn.out_proj.weight", "transformer_encoder.layers.3.self_attn.out_proj.bias", "transformer_encoder.layers.3.linear1.weight", "transformer_encoder.layers.3.linear1.bias", "transformer_encoder.layers.3.linear2.weight", "transformer_encoder.layers.3.linear2.bias", "transformer_encoder.layers.3.norm1.weight", "transformer_encoder.layers.3.norm1.bias", "transformer_encoder.layers.3.norm2.weight", "transformer_encoder.layers.3.norm2.bias", "decoder.weight", "decoder.bias".
Unexpected key(s) in state_dict: "module.encoder.weight", "module.encoder.bias", "module.encoder_norm.weight", "module.encoder_norm.bias", "module.transformer_encoder.layers.0.self_attn.in_proj_weight", "module.transformer_encoder.layers.0.self_attn.in_proj_bias", "module.transformer_encoder.layers.0.self_attn.out_proj.weight", "module.transformer_encoder.layers.0.self_attn.out_proj.bias", "module.transformer_encoder.layers.0.linear1.weight", "module.transformer_encoder.layers.0.linear1.bias", "module.transformer_encoder.layers.0.linear2.weight", "module.transformer_encoder.layers.0.linear2.bias", "module.transformer_encoder.layers.0.norm1.weight", "module.transformer_encoder.layers.0.norm1.bias", "module.transformer_encoder.layers.0.norm2.weight", "module.transformer_encoder.layers.0.norm2.bias", "module.transformer_encoder.layers.1.self_attn.in_proj_weight", "module.transformer_encoder.layers.1.self_attn.in_proj_bias", "module.transformer_encoder.layers.1.self_attn.out_proj.weight", "module.transformer_encoder.layers.1.self_attn.out_proj.bias", "module.transformer_encoder.layers.1.linear1.weight", "module.transformer_encoder.layers.1.linear1.bias", "module.transformer_encoder.layers.1.linear2.weight", "module.transformer_encoder.layers.1.linear2.bias", "module.transformer_encoder.layers.1.norm1.weight", "module.transformer_encoder.layers.1.norm1.bias", "module.transformer_encoder.layers.1.norm2.weight", "module.transformer_encoder.layers.1.norm2.bias", "module.transformer_encoder.layers.2.self_attn.in_proj_weight", "module.transformer_encoder.layers.2.self_attn.in_proj_bias", "module.transformer_encoder.layers.2.self_attn.out_proj.weight", "module.transformer_encoder.layers.2.self_attn.out_proj.bias", "module.transformer_encoder.layers.2.linear1.weight", "module.transformer_encoder.layers.2.linear1.bias", "module.transformer_encoder.layers.2.linear2.weight", "module.transformer_encoder.layers.2.linear2.bias", "module.transformer_encoder.layers.2.norm1.weight", "module.transformer_encoder.layers.2.norm1.bias", "module.transformer_encoder.layers.2.norm2.weight", "module.transformer_encoder.layers.2.norm2.bias", "module.transformer_encoder.layers.3.self_attn.in_proj_weight", "module.transformer_encoder.layers.3.self_attn.in_proj_bias", "module.transformer_encoder.layers.3.self_attn.out_proj.weight", "module.transformer_encoder.layers.3.self_attn.out_proj.bias", "module.transformer_encoder.layers.3.linear1.weight", "module.transformer_encoder.layers.3.linear1.bias", "module.transformer_encoder.layers.3.linear2.weight", "module.transformer_encoder.layers.3.linear2.bias", "module.transformer_encoder.layers.3.norm1.weight", "module.transformer_encoder.layers.3.norm1.bias", "module.transformer_encoder.layers.3.norm2.weight", "module.transformer_encoder.layers.3.norm2.bias", "module.decoder.weight", "module.decoder.bias".
Hi,
Thanks again for this amazing work.
I do have a question regarding the position of src_mask in model.py. I think this should be done after src rnn padding, right? Since the each sequence in src input might be with different length and input src is actually a list of sequence which cannot use src.size(1) to get the sequence length.
Looking forward to your reply. Thanks!
Thank you for your PyTorch implementation of SA-EEND. When I Infer the callhome2-2spk dataset, I find that I cannot correspond my results to your experience result table in README.md. I use the 'model_callhome.th' as initialization. The results are as below.
In that way, which line of table in README is 'model_callhome.th' correspond to? "PyTorch" or "PyTorch*"? If it corresponds to "PyTorch", the best DER result is 10.97, rather than 11.21
I am looking forward to your reply. Thanks!
Thank you for your open source code. Could you release the pretrain model training on Mixer6 + SRE + SWBD dataset? Most guy like me could not access these database
Hello there,
first of all thanks for the implementation of the EEND system in pytorch. I am trying to run the large model with simulated data and it turns out that the model occupies more than 20GB in GPU!
is this normal or could it be due to some implementation error?
position:eend/pytorch_ backend/models.py
Line 92
Is it "src"?
I set gpu: 4 in train.yaml to train to use multi gpu.
But I got the error as below.
raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
How can I avoid the error?
The full error message is as follows.
Traceback (most recent call last):
File "eend/bin/train.py", line 63, in
train(args)
File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/train.py", line 142, in train
loss, label = batch_pit_loss(output, t)
File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/loss.py", line 58, in batch_pit_loss
losses, labels = zip(*loss_w_labels)
File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/loss.py", line 58, in
losses, labels = zip(*loss_w_labels)
File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/loss.py", line 38, in pit_loss
losses = torch.stack([F.binary_cross_entropy_with_logits(pred[label_delay:, ...], l[:len(l) - label_delay, ...]) for l in label_perms])
File "/data001/bongki.lee/sources/EEND/pytorch_EEND/eend/pytorch_backend/loss.py", line 38, in
losses = torch.stack([F.binary_cross_entropy_with_logits(pred[label_delay:, ...], l[:len(l) - label_delay, ...]) for l in label_perms])
File "/home/VI251703/.conda/envs/pytorch_eend/lib/python3.7/site-packages/torch/nn/functional.py", line 2827, in binary_cross_entropy_with_logits
raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
ValueError: Target size (torch.Size([500, 2])) must be the same as input size (torch.Size([125, 2]))
Hi,
In End-to-End Neural Speaker Diarization with Self-attention/Fig. 2., LayerNorm was applied after the Encoder block
s, but in your implementation, the order was reversed. Are there any particular reasons for that?
Have a good day.
How can I experiment on CALLHOME dataset?
I used your pre training model and then trained in the AMI train dataset. Then test the model in AMI test. The config of adaption list bellow:
sampling_rate: 8000
frame_size: 200
frame_shift: 80
model_type: Transformer
max_epochs: 100
gradclip: 5
batchsize: 64
hidden_size: 256
num_frames: 500
num_speakers: 5
input_transform: logmel23_mn
optimizer: adam
lr: 1e-5
context_size: 7
subsampling: 10
gradient_accumulation_steps: 1
transformer_encoder_n_heads: 4
transformer_encoder_n_layers: 4
transformer_encoder_dropout: 0.1
noam_warmup_steps: 100000
seed: 777
gpu: 1
During training and testing, the of AMI data set is down sampled to 8kHz, and num_speakers is adjusted from 2 to 5.(The number of speakers in AMI dataset is between 2-5)
The final training loss is 0.09410,and the best DER for the dev set is 28.64. The best test DER is 78.72. It seem that the model cannot distinguish different speakers.
SPEAKER EN2002a 1 0.00 260.00 EN2002a_0
SPEAKER EN2002a 1 260.06 0.06 EN2002a_0
SPEAKER EN2002a 1 260.15 0.02 EN2002a_0
SPEAKER EN2002a 1 260.24 0.01 EN2002a_0
SPEAKER EN2002a 1 260.68 0.01 EN2002a_0
SPEAKER EN2002a 1 261.21 0.04 EN2002a_0
SPEAKER EN2002a 1 261.28 0.12 EN2002a_0
SPEAKER EN2002a 1 261.41 0.06 EN2002a_0
SPEAKER EN2002a 1 261.76 0.09 EN2002a_0
SPEAKER EN2002a 1 261.86 0.08 EN2002a_0
SPEAKER EN2002a 1 264.35 0.01 EN2002a_0
SPEAKER EN2002a 1 264.41 0.03 EN2002a_0
SPEAKER EN2002a 1 264.45 0.03 EN2002a_0
SPEAKER EN2002a 1 264.66 0.01 EN2002a_0
SPEAKER EN2002a 1 264.85 0.01 EN2002a_0
SPEAKER EN2002a 1 264.87 0.01 EN2002a_0
SPEAKER EN2002a 1 264.89 0.04 EN2002a_0
SPEAKER EN2002a 1 264.94 0.01 EN2002a_0
SPEAKER EN2002a 1 265.34 0.08 EN2002a_0
SPEAKER EN2002a 1 265.43 0.07 EN2002a_0
SPEAKER EN2002a 1 266.27 0.01 EN2002a_0
SPEAKER EN2002a 1 266.30 0.01 EN2002a_0
SPEAKER EN2002a 1 266.34 0.03 EN2002a_0
SPEAKER EN2002a 1 266.73 0.01 EN2002a_0
SPEAKER EN2002a 1 266.79 0.03 EN2002a_0
SPEAKER EN2002a 1 266.83 0.01 EN2002a_0
SPEAKER EN2002a 1 267.22 0.01 EN2002a_0
SPEAKER EN2002a 1 267.68 0.02 EN2002a_0
SPEAKER EN2002a 1 267.78 0.01 EN2002a_0
SPEAKER EN2002a 1 267.80 0.02 EN2002a_0
SPEAKER EN2002a 1 267.83 0.01 EN2002a_0
SPEAKER EN2002a 1 267.90 0.01 EN2002a_0
SPEAKER EN2002a 1 270.72 0.01 EN2002a_0
SPEAKER EN2002a 1 270.82 0.02 EN2002a_0
SPEAKER EN2002a 1 270.87 0.01 EN2002a_0
SPEAKER EN2002a 1 271.58 0.01 EN2002a_0
SPEAKER EN2002a 1 273.28 0.02 EN2002a_0
SPEAKER EN2002a 1 273.31 0.01 EN2002a_0
SPEAKER EN2002a 1 273.79 0.02 EN2002a_0
SPEAKER EN2002a 1 275.43 0.01 EN2002a_0
SPEAKER EN2002a 1 275.53 0.01 EN2002a_0
SPEAKER EN2002a 1 277.82 0.01 EN2002a_0
SPEAKER EN2002a 1 277.85 0.03 EN2002a_0
SPEAKER EN2002a 1 277.89 0.04 EN2002a_0
SPEAKER EN2002a 1 277.95 0.01 EN2002a_0
SPEAKER EN2002a 1 277.97 0.01 EN2002a_0
SPEAKER EN2002a 1 278.01 0.01 EN2002a_0
SPEAKER EN2002a 1 278.05 0.01 EN2002a_0
SPEAKER EN2002a 1 278.13 0.01 EN2002a_0
SPEAKER EN2002a 1 279.85 0.03 EN2002a_0
SPEAKER EN2002a 1 279.95 0.01 EN2002a_0
SPEAKER EN2002a 1 280.00 69.07 EN2002a_0
SPEAKER EN2002a 1 349.08 0.08 EN2002a_0
SPEAKER EN2002a 1 349.22 4.80 EN2002a_0
SPEAKER EN2002a 1 354.30 185.70 EN2002a_0
SPEAKER EN2002a 1 560.00 21.10 EN2002a_0
SPEAKER EN2002a 1 581.11 0.09 EN2002a_0
SPEAKER EN2002a 1 581.22 0.03 EN2002a_0
SPEAKER EN2002a 1 581.34 0.01 EN2002a_0
SPEAKER EN2002a 1 581.38 0.01 EN2002a_0
SPEAKER EN2002a 1 581.40 1.32 EN2002a_0
SPEAKER EN2002a 1 582.73 0.02 EN2002a_0
SPEAKER EN2002a 1 582.77 1.36 EN2002a_0
SPEAKER EN2002a 1 584.15 0.06 EN2002a_0
SPEAKER EN2002a 1 584.30 0.01 EN2002a_0
SPEAKER EN2002a 1 584.39 0.71 EN2002a_0
SPEAKER EN2002a 1 585.11 9.06 EN2002a_0
SPEAKER EN2002a 1 594.18 0.01 EN2002a_0
SPEAKER EN2002a 1 594.72 0.02 EN2002a_0
SPEAKER EN2002a 1 594.75 1.22 EN2002a_0
SPEAKER EN2002a 1 596.32 0.01 EN2002a_0
SPEAKER EN2002a 1 596.34 0.04 EN2002a_0
SPEAKER EN2002a 1 596.39 0.01 EN2002a_0
SPEAKER EN2002a 1 596.41 1.47 EN2002a_0
SPEAKER EN2002a 1 597.89 0.02 EN2002a_0
SPEAKER EN2002a 1 597.92 142.08 EN2002a_0
SPEAKER EN2002a 1 763.29 0.04 EN2002a_0
SPEAKER EN2002a 1 764.11 0.02 EN2002a_0
SPEAKER EN2002a 1 780.00 80.00 EN2002a_0
SPEAKER EN2002a 1 880.00 240.00 EN2002a_0
SPEAKER EN2002a 1 1140.00 300.00 EN2002a_0
SPEAKER EN2002a 1 1440.04 0.02 EN2002a_0
SPEAKER EN2002a 1 1440.15 0.01 EN2002a_0
SPEAKER EN2002a 1 1440.17 0.02 EN2002a_0
SPEAKER EN2002a 1 1440.22 0.02 EN2002a_0
Hi,
First of all I want to give a thank to you for open-sourcing this work,
Secondly, does it support do inference on a single audio file?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.