Giter Club home page Giter Club logo

tf-kaldi-speaker-master's Introduction

Overview

The tf-kaldi-speaker implements a neural network based speaker verification system using Kaldi and TensorFlow.

The main idea is that Kaldi can be used to do the pre- and post-processings while TF is a better choice to build the neural network. Compared with Kaldi nnet3, the modification of the network (e.g. adding attention, using different loss functions) using TF costs less. Adding other features to support text-dependent speaker verification is also possible.

The purpose of the project is to make researches on neural network based speaker verification easier. I also try to reproduce some results in my papers.

Requirement

  • Python: 2.7 (Update to 3.6/3.7 should be easy.)

  • Kaldi: >5.5

    Since Kaldi is only used to do the pre- and post-processing, most version >5.2 works. Though I'm not 100% sure, I believe Kaldi with x-vector support (e.g. egs/sre16/v2) is enough. But if you want to run egs/voxceleb, make sure your Kaldi also contains this examples.

  • Tensorflow: >1.4.0

    I write the code using TF 1.4.0 at the very beginning. Then I updated to v1.12.0. The future version will support TF >1.12 but I will try to make the API compatible with lower versions. Due to the API changes (e.g. keep_dims to keepdims in some functions), some may experience incorrect parameters. In that case, simply check the parameters may fix these problems.

Methodology

The general pipeline of our framework is:

  • For training:
  1. Kaldi: Data preparation --> feature extraction --> training example generateion (CMVN + VAD + ...)
  2. TF: Network training (training examples + nnet config)
  • For test:
  1. Kaldi: Data preparation --> feature extraction
  2. TF: Embedding extraction
  3. Kaldi: Backend classifier (Cosine/PLDA) --> performance evaluation
  • Evaluate the performance:
    • MATLAB is used to compute the EER, minDCF08, minDCF10, minDCF12.
    • If you do not have MATLAB, Kaldi also provides scripts to compute the EER and minDCFs. The minDCF08 from Kaldi is 10x larger than DETware due to the computation method.

In our framework, the speaker embedding can be trained and extracted using different network architectures. Again, the backend classifier is integrated using Kaldi.

Features

  • Entire pipeline of neural network based speaker verification.
  • Both training from scratch and fine-tuning a pre-trained model are supported.
  • Examples including VoxCeleb and SRE. Refer to Fisher to customized the dataset.
  • Standard x-vector architecture (with minor modification).
  • Angular softmax, additive margin softmax, additive angular margin softmax, triplet loss and other loss functions.
  • Self attention and other attention methods (testing).
  • I'm now working on multitask_v1 learning.

Usage

  • The demos for SRE and VoxCeleb are included in egs/{sre,voxceleb}. Follow run.sh to go through the code.
  • The neural networks are configured using JSON files which are included in nnet_conf and the usage of the parameters is exhibited in the demos.

Performance & Speed

  • Performance

    I've test the code on three datasets and the results are better than the standard Kaldi recipe. (Of course, you can achieve better performance using Kaldi by carefully tuning the parameters.)

    See RESULTS for details.

  • Speed

    Since it only support single gpu, the speed is not very fast but acceptable in medium-scale datasets. For VoxCeleb, the training takes about 2.5 days using Nvidia P100 and it takes ~4 days for SRE.

    The speed could be much faster if multi-GPUs are used. The multi-GPU support is not too difficult. Adding gradient towers can achieve that.

Pretrained models

  • VoxCeleb

    Training data: VoxCeleb1 dev set and VoxCeleb2

    Google Drive and

    BaiduYunDisk (extraction code: xwu6)

  • NIST SRE

    Training data: NIST SRE04-08, SWBD

    The SRE models are available upon request.

Pros and cons

  • Advantages

    1. Performance: The performance of our code is shown to perform better than Kaldi.
    2. Storage: There is no need to generate a packed egs as Kaldi when training the network. The training will load the data on the fly.
    3. Flexibility: Changing the network architecture and loss function is pretty easy.
  • Disadvantages

    1. Training speed: Only support single GPU at this moment. Multiple GPU can be achieved by using data parallelism and tower-based training.
    2. Since no packed egs are generated. Multiple CPUs must be used to load the data during training. This is a overhead when the utterances are very long. You have to assign enough CPUs to make the loading speech fast enough to match the GPU processing speed.

Other discussions

  • In this code, I provide two possible methods to tune the learning rate when SGD is used: using validation set and using fixed file. The first method works well but it may take longer to train the network.

  • David Snyder and Dan Povey just released a new paper about the diarization performance using x-vector. The network in that paper is updated to more than 10 layers. You may like to change model/tdnn.py to implement the new network. Resnet is also used in many works. I haven't done anything to find the best network architecture. Deeper network is worth trying since we have enough training data.

License

Apache License, Version 2.0 (Refer to LICENCE)

Acknowledgements

The computational resources are initially provided by Prof. Gales, and then mainly supported by Dr. Liang He.

Last ...

  • Unfortunately, the code is developed under Windows. The file property cannot be maintained properly. After downloading the code, simply run:

    find ./ -name "*.sh" | awk '{print "chmod +x "$1}' | sh
    

    to add the 'x' property to the .sh files.

  • For cluster setup, please refer to Kaldi for help. In my case, the program is run locally. Modify cmd.sh and path.sh just according to standard Kaldi setup.

  • If you encounter any problems, please make an issue.

  • If you have any extensions, feel free to create a PR.

  • Details (configurations, adding network components, etc.) will be updated later.

  • Contact:

    E-mail: liry17 (at) mails (dot) tsinghua (dot) edu (dot) cn

Related papers

I'm working on a paper describing the performance using this library. Will update soon.

tf-kaldi-speaker-master's People

Contributors

ishine avatar someonefighting avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.