Giter Club home page Giter Club logo

tinydiarize's Introduction

tinydiarize ๐Ÿฅ๐Ÿ—ฃ๏ธ

  • Speaker diarization labels who said what in a transcript (e.g. Speaker A, Speaker B โ€ฆ). It is essential for conversation transcripts like meetings or podcasts.
  • tinydiarize aims to be a minimal, interpretable extension of OpenAI's Whisper models that adds speaker diarization with few extra dependencies (inspired by minGPT).
  • This uses a finetuned model that adds special tokens to mark speaker changes [1,2,3,4]. It can use both voice and semantic context to tell speakers apart, which is a unique benefit of this approach.
  • It needs a tiny change to the inference code (<50 lines ), and runs with minimal extra cost. This makes it easy to add to ports like whisper.cpp that run on consumer hardware like MacBooks and iPhones.

Demo

demo_video-trim.mp4

You can try it out on other such gems from YouTube using this notebook. Open In Colab

Quickstart

Install ffmpeg following the original repo, then run:

pip install -e .
whisper --model small.en-tdrz AUDIO 

The only change is the small.en-tdrz model instead of small.en. That's it! ๐ŸŽ‰

What's included?

  • Finetuned checkpoint for the small.en-tdrz model (located here) and example inference code (relevant edits in [#4] [#11]). This has the same dependencies as the original whisper repo.
  • Tools for comparison and analysis (under /tdrz_dev):
    • A scoring tool to measure and compare accuracy on your own data in an easy to interpret way.
    • A reference script to run and compare various diarization pipelines.
    • A Jupyter notebook to compare and understand performance in detail.
  • Finetuning code will also be made available shortly.

We aim to provide a starting point enabling anyone (or even OpenAI themselves!) to improve performance and extend support (multilingual, speech translation etc.).

Performance

metric small.en small.en-tdrz
spk_turn_precision - 97.7
spk_turn_recall - 70.8
wer_overall 11.0 10.3
wer_speaker_switch 15.0 15.5

On a (tiny) benchmark set of 3 earnings calls, tdrz gets near-perfect speaker turn precision at fairly decent recall. A similar WER is retained as the original model. Not too shabby for a tiny finetuning setup, and <10% extra inference cost!

Refer to tdrz_dev for details on performance analysis and comparisons.

More info

  • Whisper small.en checkpoints were finetuned on ~100hrs of AMI meetings using HuggingFace Transformers and Datasets.
  • With some tricks, this could be done relatively cheaply with just 30mins of 1 GPU training starting to produce decent results. Tiny indeed ๐Ÿ˜Š.
  • We used helpful tools from pyannote (the OG open-source diarization toolkit) for finetuning data preparation and also analyze its performance.
  • We make use of the excellent open-source revdotcom/fstalign tool for scoring and analysis.
  • Stay tuned for details in an upcoming blog post! ๐Ÿ“บ

Gotchas

Note that this still an early proof-of-concept and there are a few things to be aware of:

  • Only the small.en English model has been finetuned.
  • Word-error-rate (WER) is close to original models, although not yet extensively tested. Ad-hoc inspection does show some differences in timestamp behavior (longer segments) or deletion errors. See the notebook under tdrz_dev for details.
  • Given a pretty tiny finetuning setup, there's likely a lot of room for further accuracy improvements.
  • Only local diarization (segmentation into speaker turns) is handled so far. Extension with global diarization (speaker clustering) is planned for later.
  • Stuff is still hacky and subject to change, so hold your horses just yet! ๐ŸŽ

Roadmap

  • inference code & demo
  • scoring and analysis tools
  • whisper.cpp integration
  • reproducible dataprep + finetuning*
  • blog post explainer*
  • HuggingFace integration
  • better LoRa-based small.en checkpoint
  • possibly clustering with NME-SC?
  • possibly large-v2 checkpoint?

* is a pointer to where I am at the moment. Contributions will be easier to make after this, and are most welcome!

References

[1] Joint Speech Recognition and Speaker Diarization via Sequence Transduction [2] Serialized Output Training for End-to-End Overlapped Speech Recognition [3] Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection [4] Adapting Multi-Lingual ASR Models for Handling Multiple Talkers

For information on the underlying Whisper model, please refer to the original documentation (release: 20230308)

License

Code and model weights are released under the MIT License. See LICENSE for further details.

tinydiarize's People

Contributors

jongwook avatar akashmjn avatar elieron avatar hennerm avatar vickianand avatar vulumecode avatar fcakyon avatar jumon avatar aaryan369 avatar andrewchernyh avatar stupid-kid-af avatar bquast avatar codebycaleb avatar corentinj avatar dmarx avatar flockonus avatar gglanzani avatar jibinmathew69 avatar ldanilov avatar ain-soph avatar mgoin avatar michaelmonashev avatar bubthegreat avatar mikkovedru avatar nick-konovalchuk avatar nielsmayer avatar drdaxxy avatar paulharter avatar cool-rr avatar rom1504 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.