pyannote / pyannote-audio Goto Github PK

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

Home Page: http://pyannote.github.io

License: MIT License

Python 36.09% Jupyter Notebook 63.91%

pytorch speech-processing speaker-diarization speech-activity-detection speaker-change-detection speaker-embedding voice-activity-detection pretrained-models overlapped-speech-detection speaker-recognition

pyannote-audio's People

Contributors

Stargazers

Watchers

Forkers

greggovit mkingupta 1215thebqtic lvaleriu nemocpp hdubey yinruiqing thaivinhnguyen zxpp mhu-coder dingke leonmak chunqishi yat011 xiaoyaliao yongyug hyzcn templeblock ericustc liviust shafiahmed cclauss evaldsurtans stevenlol baifengbai dsilva-tbox billbrazerzhang hahadashi terencecz ajilim benjisympa hubeibei007 saber5433 maggie0830 hyzhan ruslanmlnkv sbillah usenkoa chavesliu xiaotingfu yanxiaobin-ben nd1511 pkorshunov opencvnoob pplus diego-fustes wangmengzhi strob jasonaidm 1514louluo lbqin alanderex cdyangbo jiancao92 instinct2k18 chitwansaharia orangebaowang jcakilv hadware lbxcfx lijianhackthon xiaohanghang summerfanny ruiboshi xinkez ruohoruotsi 520liuxing jsalt2019-diadet chnghia nithinambit liaixiong devarshi87 reloadbrain mymoza dschlessman heibaicai myhugong wesbz lgalmant lcf2764 juanmc2005 yahcong marvinlvn liuweiping2020 suzinia forgottoforget xiongmaoxia rubenbsb mpariente zhaoforever calmncollected melspectrum007 ml-lab paullerner alreal0 jukaradayi shiwanglei twistedmove gradjitta picheny-nyu

pyannote-audio's Issues

List of embedding losses

Code

https://github.com/cvqluu/Angular-Penalty-Softmax-Losses-Pytorch
Angular penalty loss functions in Pytorch (ArcFace, SphereFace, Additive Margin, CosFace)

from pyannote.databse.util import get_annotated in pyannote/audio/generators/speech.py

line 33
databse or database ?

Alternative triplet sampling strategy

We should implement (and evaluat) the following ones.
Given anchor and positive samples,

hard negative - choose negative at random such that
d(anchor, negative) < d(anchor, positive) + margin
hardest negative - same as hard negative +
negative = argmin(d(anchor, negative))
semi-hard negative - same as hard negative +
d(anchor, negative) > d(anchor, positive)

Bias in final linear layer of ClopiNet architecture?

@GregGovit does your implementation of weighted ClopiNet (i.e. ClopiNet with a final fully-connected layer) contains a bias in the linear activation?

Example with OS database?

This looks really nice. I was wondering if you have or plan to make available a quickstart example using an open source database?

Tools

pyroomacoustics Pyroomacoustics is a package for audio signal processing for indoor applications. It was developed as a fast prototyping platform for beamforming algorithms in indoor scenarios.
AudioTSM is a python library for real-time audio time-scale modification procedures, i.e. algorithms that change the speed of an audio signal without changing its pitch.
CLEESE is a sound-manipulation software tool designed to generate an infinite number of natural-sounding, expressive variations around an original speech recording.
pydub, manipulate audio with a simple and easy high level interface
pysox Python wrapper around sox
Data Augmentation For Wearable Sensor Data
muda: A library for augmenting annotated audio data
maracas is a library for corrupting audio files with additive and convolutive noise.
pySpeechRev efficient speech reverberation starting from a dataset of close-talking speech signals and a collection of acoustic impulse responses.
Reverb class in PASE
rir_simulator_python

Databases

DEMAND

I get following error messages when trying to run the tune command on my adapted template. With training everything went as expected, I think. This creates tune.png and tune.yml files, but I assume this doesn't look as it should:

This is the error message:

Iteration No: 1 ended. Evaluation done at provided point.
Time taken: 3.3038
Function value obtained: 0.0000
Current minimum: 0.0000
Iteration No: 2 started. Evaluating function at random point.
/Users/niko/anaconda3/envs/py35-pyannote-audio/lib/python3.5/site-packages/keras/models.py:245: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: '
/Users/niko/anaconda3/envs/py35-pyannote-audio/lib/python3.5/site-packages/skopt/optimizer/optimizer.py:195: UserWarning: The objective has been evaluated at this point before.
  warnings.warn("The objective has been evaluated "
  ... (the same message repeated)
  Iteration No: 20 ended. Search finished for the next optimal point.
Time taken: 8.3426
Function value obtained: 0.0000
Current minimum: 0.0000
Exception ignored in: <bound method BaseSession.__del__ of <tensorflow.python.client.session.Session object at 0x12d522a58>>
Traceback (most recent call last):
  File "/Users/niko/anaconda3/envs/py35-pyannote-audio/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 582, in __del__
AttributeError: 'NoneType' object has no attribute 'TF_DeleteStatus'

add option to restart training after crash

This could be used in the following way:

embedding.py train --restart=<epoch> ...

This also needs the following changes

update SequenceEmbedding.to_disk method to also save the state of the optimizer
update SequenceEmbedding.from_disk (or add a new method) to load a saved optimizer
update SequenceEmbedding.fit to support preloaded optimizer

cc @GregGovit

Add multi-GPU support

https://github.com/fchollet/keras/blob/3dd3e8331677e68e7dec6ed4a1cbf16b7ef19f7f/keras/utils/training_utils.py#L56-L75

Citation of "Binarize predictions using onset/offset thresholding"

Hi Bredin,

I want to cite your idea of "Binarize predictions using onset/offset thresholding" for speech activity detection. I think this is quite interesting comparing to a single threshold.

However, I failed to find introduction about that in two of your papers "Tristounet: Triplet Loss for Speaker Turn Embedding" and "Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks".

Could you tell me which paper of yours contains introduction about this idea so that I could cite in my work?

Many thanks.

Nan loss when training a speaker-embedding with the TristouNet architecture

Hello Hervé,
I am currently trying to train the speaker embedding module using the TristouNet architecture but I end up with a loss of nan from the second epoch... So, here is the command I am running:

 $ pyannote-speaker-embedding-keras train --database=db.yml --subset=train tutorials/speaker-embedding/2+0.5/TristouNet Etape.SpeakerDiarization.TV

And here are the warnings/log messages:

/home/mahu/anaconda3/envs/pyannote/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
/home/mahu/anaconda3/envs/pyannote/lib/python3.5/site-packages/pyannote/generators/indices.py:84: UserWarning: 5 labels (out of 179) have less than 3 training samples.
  per_label=per_label))
Epoch 1/1000
2018-01-24 17:20:21.787683: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
/home/mahu/anaconda3/envs/pyannote/lib/python3.5/site-packages/autograd/core.py:81: RuntimeWarning: divide by zero encountered in power
  result_value = self.fun(*argvals, **kwargs)
/home/mahu/anaconda3/envs/pyannote/lib/python3.5/site-packages/autograd/numpy/numpy_grads.py:84: RuntimeWarning: invalid value encountered in multiply
  anp.sqrt.defvjp(   lambda g, ans, vs, gvs, x : g * 0.5 * x**-0.5)
/home/mahu/anaconda3/envs/pyannote/lib/python3.5/site-packages/autograd/numpy/numpy_grads.py:46: RuntimeWarning: invalid value encountered in multiply
  unbroadcast(vs, gvs, g * y * x ** anp.where(y, y - 1, 1.)))
/home/mahu/anaconda3/envs/pyannote/lib/python3.5/site-packages/numpy/core/_methods.py:29: RuntimeWarning: invalid value encountered in reduce
  return umr_minimum(a, axis, None, out, keepdims)
/home/mahu/anaconda3/envs/pyannote/lib/python3.5/site-packages/numpy/core/_methods.py:26: RuntimeWarning: invalid value encountered in reduce
  return umr_maximum(a, axis, None, out, keepdims)
1/1 [==============================] - 36s - loss: 0.0535
Epoch 2/1000
1/1 [==============================] - 30s - loss: nan
Epoch 3/1000
1/1 [==============================] - 31s - loss: nan

Some minor details: as you may have guessed, I have slightly changed the options to the train method of pyannote-speaker-embedding-keras so that, like the data command, a path other than ~/.pyannote/db.yml could be specified for the db.yml file.
The various config.yml files (tutorial/speaker-embedding/config.yml and tutorial/speaker-embedding/2+0.5/TristouNet/config.yml) have the same content as what is given in the corresponding tutorial.

That said, another odd thing is that 2 progress indicators are printed when running

$ pyannote-speaker-embedding-keras data --database=db.yml --duration=2 --step=0.5 tutorials/speaker-embedding/ Etape.SpeakerDiarization.TV

as shown below

Training set: 0it [00:00, ?it/s]
Training set: 28it [02:57,  6.32s/it]
100%|████████████████████████████████████| 81433/81433 [00:37<00:00, 2148.18it/s]
Development set: 0it [00:00, ?it/s]
Development set: 9it [00:47,  5.32s/it]
100%|████████████████████████████████████| 23298/23298 [00:11<00:00, 2082.87it/s]
Test set: 0it [00:00, ?it/s]
Test set: 9it [00:50,  5.66s/it]
100%|████████████████████████████████████| 22815/22815 [00:10<00:00, 2132.08it/s]

So I don't really know if the problem I am facing comes from the training phase or from the data used to train the NN. Looking a bit around, the warnings given by autograd may be related to the bug reported here.
Have you encountered this problem before? And if so, how did you manage to circumvent it?

Cheers,
Mathieu

Edit:
After some printing, it appears that logs = self.loss_and_grad(batch, embedding) in pyannote/audio/embedding/approaches_keras/base.py, l. 333, yields a gradient with NaN values on the first epoch. However, I haven't been able to find the definition of loss_and_grad to narrow down the problem yet.

the default value of bidirectional in models

In pyannote-audio/pyannote/audio/labeling/models.py.
The default value of bidirectional should be in
{False, 'ave', 'concat'} . Not True.

Error with change detection

I went through the Speech Activity Detection tutorial succesfully, and the resulting test mdtm files look good in what it comes to basic segmentation, but there are some problems with Change Detection tutorial.

I get the same error now when I try evaluate and apply commands:

Using Theano backend.
Traceback (most recent call last):
  File "/Users/niko/anaconda3/envs/pyannote/bin/pyannote-change-detection", line 11, in <module>
    load_entry_point('pyannote.audio', 'console_scripts', 'pyannote-change-detection')()
  File "/Users/niko/pyannote_test/pyannote-audio/pyannote/audio/applications/change_detection.py", line 392, in main
    epoch=epoch, min_duration=min_duration)
  File "/Users/niko/pyannote_test/pyannote-audio/pyannote/audio/applications/change_detection.py", line 268, in evaluate
    predictions[uri] = aggregation.apply(dev_file)
  File "/Users/niko/pyannote_test/pyannote-audio/pyannote/audio/labeling/aggregation.py", line 120, in apply
    predictions = next(self.from_file(current_file))
  File "/Users/niko/anaconda3/envs/pyannote/lib/python3.5/site-packages/pyannote/generators/batch.py", line 365, in from_file
    incomplete=incomplete):
  File "/Users/niko/anaconda3/envs/pyannote/lib/python3.5/site-packages/pyannote/generators/batch.py", line 418, in __call__
    for fragment in self.generator.from_file(preprocessed_file):
  File "/Users/niko/anaconda3/envs/pyannote/lib/python3.5/site-packages/pyannote/generators/fragment.py", line 148, in from_file
    wav = current_file['wav']
KeyError: 'wav'

Thanks for help!

anyway to just apply

I just want to experiment with the "Speaker Change Detection" as a SOLUTION OUT OF BOX ; for the first step, I need to test it on my own unlabeled data using a pretrained model and see if it generalizes well; and then decide whether to train my own model.

Something that looks like a pretrained model was find here:
https://github.com/yinruiqing/change_detection/tree/master/model

But I do not have access to the ETAPE database, nor do I understand the complicated structure or concepts (e.g. terms like "protocol" in "<database.task.protocol>") of the database;
So, how can I replicate your amazing work on my own data ?

Thanks !

Speaker Change Detection

Hi,

We are trying to use pyannote for Speaker Change Detection in supervised way.

We have created our own database. We are trying to follow tutorial instructions. But getting following error. We haven't modified the code.

We have attached the screenshot of the error and our newly created database.

Could you please guide us in this?

Regards
Ankur
pyannote-db-template-master.tar.gz

Switch to new stacked LSTM Keras API

https://gist.github.com/fchollet/87e9a3e0539ce268222d1d597864c098

Error in apply mode

When running the apply-command I get a following error message. This comes after I have run the tune-command with the result that I describe in issue #39.

/Users/niko/anaconda3/envs/py35-pyannote-audio/lib/python3.5/site-packages/keras/models.py:245: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: '
Test set: 0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/Users/niko/anaconda3/envs/py35-pyannote-audio/bin/pyannote-speech-detection", line 11, in <module>
    sys.exit(main())
  File "/Users/niko/anaconda3/envs/py35-pyannote-audio/lib/python3.5/site-packages/pyannote/audio/applications/speech_detection.py", line 615, in main
    application.apply(protocol_name, subset=subset)
  File "/Users/niko/anaconda3/envs/py35-pyannote-audio/lib/python3.5/site-packages/pyannote/audio/applications/speech_detection.py", line 565, in apply
precomputed = Precomputed(root_dir=apply_dir)
  File "/Users/niko/anaconda3/envs/py35-pyannote-audio/lib/python3.5/site-packages/pyannote/audio/features/utils.py", line 110, in __init__
    start = f.attrs['start']
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/h5py_1496412421941/work/h5py-2.7.0/h5py/_objects.c:2846)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/h5py_1496412421941/work/h5py-2.7.0/h5py/_objects.c:2804)
  File "/Users/niko/anaconda3/envs/py35-pyannote-audio/lib/python3.5/site-packages/h5py/_hl/attrs.py", line 58, in __getitem__
    attr = h5a.open(self._id, self._e(name))
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/h5py_1496412421941/work/h5py-2.7.0/h5py/_objects.c:2846)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/h5py_1496412421941/work/h5py-2.7.0/h5py/_objects.c:2804)
  File "h5py/h5a.pyx", line 77, in h5py.h5a.open (/Users/travis/miniconda3/conda-bld/h5py_1496412421941/work/h5py-2.7.0/h5py/h5a.c:2343)
KeyError: "Can't open attribute (Can't locate attribute: 'start')"

Error in "pyannote-speaker-embedding apply"

I'm finishing going through this tutorial, and everything has gone very well otherwise, but in the last extraction part I get following error.

It is very possible that I need to do some bigger changes to my setup or do some of the earlier tutorials again, as there seems to have been quite many changes. Anyway, maybe I'm missing something obvious here so I open an issue first. I updated pyannote-audio and others to the newest versions.

pyannote-speaker-embedding apply --step=2.0 $VALIDATION_DIR/development.eer.txt ikdp.SpeakerDiarization.MyFirstProtocol $OUTPUT_DIR

Using Theano backend.
/Users/niko/anaconda3/envs/pyannote/lib/python3.5/site-packages/keras/models.py:251: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: '
Development set: 0it [00:00, ?it/s]
Development set: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "/Users/niko/anaconda3/envs/pyannote/bin/pyannote-speaker-embedding", line 11, in <module>
    load_entry_point('pyannote.audio', 'console_scripts', 'pyannote-speaker-embedding')()
  File "/Users/niko/pyannote_test/pyannote-audio/pyannote/audio/applications/speaker_embedding.py", line 791, in main
    internal=internal)
  File "/Users/niko/pyannote_test/pyannote-audio/pyannote/audio/applications/speaker_embedding.py", line 708, in apply
    fX = extraction.apply(current_file)
  File "/Users/niko/pyannote_test/pyannote-audio/pyannote/audio/embedding/extraction.py", line 142, in apply
    incomplete=True)])
  File "/Users/niko/pyannote_test/pyannote-audio/pyannote/audio/embedding/extraction.py", line 141, in <listcomp>
    [batch for batch in self.from_file(current_file,
  File "/Users/niko/anaconda3/envs/pyannote/lib/python3.5/site-packages/pyannote/generators/batch.py", line 379, in from_file
    incomplete=incomplete):
  File "/Users/niko/anaconda3/envs/pyannote/lib/python3.5/site-packages/pyannote/generators/batch.py", line 432, in __call__
    for fragment in self.generator.from_file(preprocessed_file):
  File "/Users/niko/anaconda3/envs/pyannote/lib/python3.5/site-packages/pyannote/generators/fragment.py", line 154, in from_file
    raise ValueError('source must be one of "annotated", "annotated_extent", "annotation", "support" or "audio"')
ValueError: source must be one of "annotated", "annotated_extent", "annotation", "support" or "audio"

Thanks for help!

Add parallel processing to pyannote-speech-features

Yaafe with Python 3

Hey,

Guess we both were looking forward to the Python 3 support on yaafe. Check it out, it works well for me:
https://github.com/Yaafe/Yaafe

Training failure with several Keras versions

Hi,
I have the latest version of Keras (2.1.3) and when I tried to train a model for speaker embedding I received the following error:

TypeError: get_updates() missing 1 required positional argument: 'constraints'.

I looked at keras' sources and this error is due to an API change in keras.optimizer.get_updates() that was not changed in the file pyannote/audio/optimizers.py. When I tried to downgrade to keras 2.0.0 I got another error related to inconsistency in keras.models.save_model signature.

Please fix this inconsistency or provide a suitable version of Keras.
Thanks!

hi, if wav file name is like '1505728826.11848331.wav', it will throw "ValueError: Could not find file "1505728826.1184833""

may be it's because pandas
pyannote/parser/timeline/base.py, line 54-67
df = pandas.read_table(path,
delim_whitespace=True,
header=None, names=self.fields(),
comment=self.comment(),
converters=self.converters(),
keep_default_na=False, na_values=[],
)
could be fix if add dtype={'uri': object}

Unsupervised, weakly supervised, and semi-supervised learning

Alternative architectures

Speaker Diarization using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings
End-to-end text-independent speaker verification with flexibility in utterance duration
VoxCeleb: a large-scale speaker identification dataset
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
Analysis of Length Normalization in End-to-End Speaker Verification System
Authors show that normalizing embeddings on a larger sphere leads to better results.
Speaker Recognition from Raw Waveform with SincNet
Adaptive pooling operators for weakly labeled sound event detection
This could be used to learn the final pooling layer in LSTM-based embedding
TDNN

Make min_duration range configurable in Peak.tune

pyannote-audio/pyannote/audio/signal.py

Lines 176 to 178 in 72c5547

 # 0 < alpha < 1 || 0 < min_duration < 5s 

 space = [skopt.space.Real(0., 1., prior='uniform'), 

 skopt.space.Real(0., 5., prior='uniform')]

Clustering approaches

Keep track of best model so far

ValidationCheckpoint should save the best model so far in something like best.accuracy.h5 and best.fscore.h5

It should also remember which epoch it corresponds to so it can be plotted in status.png.

what's the "architecture.yml" file of Pretrained ETAPE model?

Hi Hervé BREDIN,
what's the "architecture.yml" file of Pretrained ETAPE model?
"A TristouNet model (trained and validated on ETAPE database) is available for you to test directly in tutorials/speaker-embedding/2+0.5/TristouNet."

in TristouNet folder, I only find "0986.h5", maybe that's the weight_h5, but where is the architecture.yml?

feature-extraction tutorial 'db.yml' file not found

Hi Bredin!
I was trying to replicate feature-extraction tutorial. I followed the installation instructions as given in master branch.
But when I tried to execute command cat ~/.pyannote/db.yml, I got the error as follows:

cat: /home/abhishek/.pyannote/db.yml: No such file or directory

So, after skipping this, when I tried to extract the features from audio files pyannote-speech-feature ${EXPERIMENT_DIR} GameOfThrones I got the following error:

/home/abhishek/anaconda3/envs/pyannote-audio/lib/python3.5/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Traceback (most recent call last):
File "/home/abhishek/anaconda3/envs/pyannote-audio/bin/pyannote-speech-feature", line 11, in
sys.exit(main())
File "/home/abhishek/anaconda3/envs/pyannote-audio/lib/python3.5/site-packages/pyannote/audio/applications/feature_extraction.py", line 188, in main
preprocessors = {'wav': FileFinder(db_yml)}
File "/home/abhishek/anaconda3/envs/pyannote-audio/lib/python3.5/site-packages/pyannote/database/util.py", line 85, in init
with open(config_yml, 'r') as fp:
FileNotFoundError: [Errno 2] No such file or directory: '/home/abhishek/.pyannote/db.yml'

So, I have the following doubts:

What is db.yml file?
How is this file created? I created the pyannote.database plugin for GameOfThrones but this file was not created in the process.
Do we have to create a symbolic link to the database (audio .wav files) for ~/.pyannote/db.yml which redirects to /path/to/GameOfThrones/corpus/{uri}.wav
- Is this {uri} variable need to be defined externally in bash terminal or is directly incorporated from files?

Theory background for pyannote-speech-detection

Hi, Bredin,

I am a newer to LSTM. After reading your paper(in the citation) about tristounet, I think I have got the basic idea how it works for speaker change detection.However, I still confused about the theory background for command: pyannote-speech-detection.

Can I say that speech detection is similar to speaker turns detection if non-speech segments are considered from a special 'speaker' while speech parts are from another 'speaker'. In this case, speech boundary detection is the same to speaker change detection.

However, I am still wondered why the parameter setting in config.yml ( n_classes: 2 ) for speech activity detection is different with the setting in config.yml ( n_classes: 1 ). To be honest, I don't know the meaning of this parameter(n_classes). Could I have other introduction or tutorials about the theory background for this pyannote-speech-detection command?

Thank you for your time and patience.

Liyong Guo

Use sum of LSTM outputs instead of just last one

I guess this is just a matter of:

setting return_sequences to True
adding GlobalAveragePooling1D layers on top of LSTMs

Try alternative positive/negative sample balancing in change detection

See http://ieeexplore.ieee.org/abstract/document/7953097/

add option to provide validation subset

This could be used like that

embedding.py train --validation=<subset> ...

this would result in a few files being created (such as evolution of EER as a function of epoch number, evolution of scores distribution, etc...)

cc @GregGovit

Squared euclidean distance in triplet sampling

The original FaceNet paper uses squared euclidean distance in the triplet selection algorithm.

As of now, we use the regular euclidean distance:
See 9b59990#diff-25feed99c8806859b2dcd889d2dcc34fR137

It probably isn't big of a deal but changing euclidean to sqeuclidean would not only result in being identical to the original paper, but also speed up this part of the code.

Pre-trained model

Is there a trained TristouNet on ETAPE that I can use for domain adaption on my task of speaker turn detection?

Incorrect metric transfer

Hello, If I understand right, you don't use "metric" in module triplet_loss.py and it don't transfer SequenceEmbeddingAutograd after kwargs. In this way always use "cosine" as default.
Model traininig very slow due to distance computing. I have dataset with 1000+ speakers around 10 minutes for each and GPU. The first epoch's metrics were not computed within 3 hours. May be you have some idea how I can speed up it?
My approach settings is:

     metric: cosine
     margin: 0.1
     clamp: positive
     per_batch: 1
     per_label: 40
     per_fold: 40
     gradient_factor: 10000
     batch_size: 512

How to use this package

Can you please supply some documentation on speaker embedding and detection? It's not clear how to use this package. Moreover does it include diarization?

Notebooks

This is meant to keep track of notebooks that we may publicize in the tutorial section at some point.

TristouNet-based speaker change detection

How to remove hold music from a recorded .mp3/.wav file?

Is it possible to use this library for detecting and removing "hold music" from a given audio file, so that all that remains is the actual conversation between the people.

"model wasn't compiled" during fine-tuning

Hello! I have my own dataset with ~1500 speakers (3-10 minutes for each). I tried to train models with pyannote and have some results now.

Models works well on audio from the same devises (train and test stages), but the system quality falls very low on audio from different devices. I understand that it's may be actual only for my database. Do you test your model the same way or do you observe the same effects?
The quality grew when I have used 20 MFCC instead 11.
Pre-training model works on Russian language, EER around 20% (60 speakers).
I tried to fine-tune pre-training model. I used "--start" option for that, but got an error "model wasn't compiled". It fixed by adding:
embedding.compile(optimizer=optimizer, loss=precomputed_gradient_loss)
in 306 line base_autograd.py file, but I'm not sure that it's correct.

Thanks!

Unsupervised segmentation

I have a one conversation with several people and it is unsupervised. I would only like to segment all the speakers in the conversation. In your speaker change detection code, i have given that that url(Local filepath) of the dataset. But after one interation I was getting an error which I believe due to the length of the dataset. Which part of the code I can use from this project.?

My final goal is to only segment unique speakers from the conversation. Any idea would be appreciated or any basic algorithmic steps which i can borrow from your code.

ValueError: all input arrays must have the same shape

Hello,
I try to run pyannote on my dataset, and I got error in pack_ndarray function of "pyannote/generators/batch.py"

def pack_ndarray(self, ndarrays):
    return np.stack(ndarrays)

I print shape of my ndarrays
(201, 35)
(201, 35)
(201, 35)
(201, 35)
(108, 35)
(201, 35)
(175, 35)
(157, 35)
(34, 35)
(51, 35)
...
the first dim is not the same, so what's my mistake?

Add option to only apply sequence labeling on 'annotated' extent

Breaking change in Keras optimizer

There is a breaking change in Keras optimizer in version 2.0.7. So when I use a SMORMS3 (an optimizer defined in pyannote-audio and used in pyannote-speech-detection and pyannote-change-detection), it raise an error:
get_updates() missing 1 required positional argument: 'constraints'

build error

Hi, when I run pip install "pyannote.audio==0.3", I got the following error msg:

In file included from _pysndfile.cpp:471:0:
pysndfile.hh:55:21: fatal error: sndfile.h: No such file or directory
#include <sndfile.h>
^
compilation terminated.
error: command 'gcc' failed with exit status 1

Failed building wheel for pysndfile
Running setup.py clean for pysndfile
Failed to build pysndfile

Switch to skopt "ask and tell" API

Two advantages:

control what is shown to the end user
control which part of the code is run in parallel (needed to use GPU)

Speaker Change Detection

Hi,

I have two doubts :

Is 'speech activity detection' prerequisite to 'speaker change detection'?
Another question is current approach in pyannote is speaker and content invariant?
In striking balance between purity and coverage, what should be the value of coverage that should be good enough practically?

Regards
Ankur

what's the time cost during the training period?

Is it possible to Enable GPU for triplet loss embedding extraction? (problem of Zero GPU-Util while High GPU memory usage)

Hi, I am trying to accelerate the training process with my GPU, only to find a high GPU memory usage while a ZERO GPU-Util, as shown in the following picture.

Is there any appropriate way to adopt GPU during training of triplet models?

Ps. Both of these two threads are for triplet loss training. No other application are adopting GPU.

	# 0 < alpha < 1 \|\| 0 < min_duration < 5s
	space = [skopt.space.Real(0., 1., prior='uniform'),
	skopt.space.Real(0., 5., prior='uniform')]