Giter Club home page Giter Club logo

whisperkittools's Introduction

WhisperKit WhisperKit

Unit and Functional Tests

whisperkittools

Python tools for WhisperKit

  • Convert PyTorch Whisper models to WhisperKit format
  • Apply custom inference optimizations and model compression
  • Evaluate Whisper using WhisperKit and other Whisper implementations on benchmarks

Table of Contents

Installation

conda create -n whisperkit python=3.11 -y && conda activate whisperkit
  • Step 3: Install as editable
cd WHISPERKIT_ROOT_DIR && pip install -e .

Model Generation

Convert Hugging Face Whisper Models (PyTorch) to WhisperKit (Core ML) format:

whisperkit-generate-model --model-version <model-version> --output-dir <output-dir>

For optional arguments related to model optimizations, please see the help menu with -h

Publishing Models

We host several popular Whisper model versions here. These hosted models are automatically over-the-air deployable to apps integrating WhisperKit such as our example app WhisperAX on TestFlight. If you would like to publish custom Whisper versions that are not already published, you can do so as follows:

  • Step 1: Find the user or organization name that you have write access to on Hugging Face Hub. If you are logged into huggingface-cli locally, you may simply do:
huggingface-cli whoami

If you don't have a write token yet, you can generate it here.

  • Step 2: Point to the model repository that you would like to publish to, e.g. my-org/my-whisper-repo-name, with the MODEL_REPO_ID environment variable and specify the name of the source PyTorch Whisper repository (e.g. distil-whisper/distil-small.en)
MODEL_REPO_ID=my-org/my-whisper-repo-name whisperkit-generate-model --model-version distil-whisper/distil-small.en --output-dir <output-dir>

If the above command is successfuly executed, your model will have been published to hf.co/my-org/my-whisper-repo-name/distil-whisper_distil-small.en!

Model Evaluation

Evaluate (Argmax- or developer-published) models on speech recognition datasets:

whisperkit-evaluate-model --model-version <model-version> --output-dir <output-dir> --dataset {librispeech-debug,librispeech,earnings22}

By default, this command uses the latest main branch commits from WhisperKit and searches within Argmax-published model repositories. For optional arguments related to code and model versioning, please see the help menu with -h

We continually publish the evaluation results of Argmax-hosted models here as part of our continuous integration tests.

Model Evaluation on Custom Dataset

If you would like to evaluate WhisperKit models on your own dataset:

  • Step 1: Publish a dataset on the Hub with the same simple structure as this toy dataset (audio files + metadata.json)
  • Step 2: Run evaluation with environment variables as follows:
export CUSTOM_EVAL_DATASET="my-dataset-name-on-hub"
export DATASET_REPO_OWNER="my-user-or-org-name-on-hub"
export MODEL_REPO_ID="my-org/my-whisper-repo-name" # if evaluating self-published models
whisperkit-evaluate-model --model-version <model-version> --output-dir <output-dir> --dataset my-dataset-name-on-hub

Python Inference

Use the unified Python wrapper for several Whisper frameworks:

from whisperkit.pipelines import WhisperKit, WhisperCpp, WhisperMLX

pipe = WhisperKit(whisper_version="openai/whisper-large-v3", out_dir="/path/to/out/dir")
print(pipe("audio.{wav,flac,mp3}"))

Note: Using WhisperCpp requires ffmpeg to be installed. Recommended installation is with brew install ffmpeg

Example SwiftUI App

TestFlight

Source Code (MIT License)

This app serves two purposes:

  • Base template for developers to freely customize and integrate parts into their own app
  • Real-world testing/debugging utility for custom Whisper versions or WhisperKit features before/without building an app.

Note that the app is in beta and we are actively seeking feedback to improve it before widely distributing it.

WhisperKit Evaluation Results

Dataset: librispeech

Short-form Audio (<30s/clip) - 5 hours of English audiobook clips

WER (↓) QoI (↑) File Size (MB) Code Commit
large-v2 (WhisperOpenAIAPI) 2.35 100 3100 N/A
large-v2 2.77 96.6 3100 Link
large-v2_949MB 2.4 94.6 949 Link
large-v2_turbo 2.76 96.6 3100 Link
large-v2_turbo_955MB 2.41 94.6 955 Link
large-v3 2.04 95.2 3100 Link
large-v3_947MB 2.46 93.9 947 Link
large-v3_turbo 2.03 95.4 3100 Link
large-v3_turbo_954MB 2.47 93.9 954 Link
distil-large-v3 2.47 89.7 1510 Link
distil-large-v3_594MB 2.96 85.4 594 Link
distil-large-v3_turbo 2.47 89.7 1510 Link
distil-large-v3_turbo_600MB 2.78 86.2 600 Link
small.en 3.12 85.8 483 Link
small 3.45 83 483 Link
base.en 3.98 75.3 145 Link
base 4.97 67.2 145 Link
tiny.en 5.61 63.9 66 Link
tiny 7.47 52.5 66 Link

Dataset: earnings22

Long-Form Audio (>1hr/clip) - 120 hours of earnings call recordings in English with various accents

WER (↓) QoI (↑) File Size (MB) Code Commit
large-v2 (WhisperOpenAIAPI) 16.27 100 3100 N/A
large-v3 15.17 58.5 3100 Link
base.en 23.49 6.5 145 Link
tiny.en 28.64 5.7 66 Link

Explanation

We believe that rigorously measuring the quality of inference is necessary for developers and enterprises to make informed decisions when opting to use optimized or compressed variants of any machine learning model in production. To contextualize WhisperKit, we take the following Whisper implementations and benchmark them using a consistent evaluation harness:

Server-side:

($0.36 per hour of audio as of 02/29/24, 25MB file size limit per request)

On-device:

(All on-device implementations are available for free under MIT license as of 03/19/2024)

WhisperOpenAIAPI sets the reference and we assume that it is using the equivalent of openai/whisper-large-v2 in float16 precision along with additional undisclosed optimizations from OpenAI. In all measurements, we care primarily about per-example no-regressions (quantified as qoi below) which is a stricter metric compared to dataset average Word Error RATE (WER). A 100% qoi preserves perfect backwards-compatibility on the test distribution and avoids "perceived regressions", the phenomenon where per-example known behavior changes after a code/model update and causes divergence in downstream code or breaks the user experience itself (even if dataset averages might stay flat across updates). Pseudocode for qoi:

qoi = []
for example in dataset:
    no_regression = wer(optimized_model(example)) <= wer(reference_model(example))
    qoi.append(no_regression)
qoi = (sum(qoi) / len(qoi)) * 100.

Note that the ordering of models with respect to WER does not necessarily match the ordering with respect to QoI. This is because the reference model gets assigned a QoI of 100% by definition. Any per-example regression by other implementations get penalized while per-example improvements are not rewarded. QoI (higher is better) matters where the production behavior is established by the reference results and the goal is to not regress when switching to an optimized or compressed model. On the other hand, WER (lower is better) matters when there is no established production behavior and one is picking the best quality versus model size trade off point.

We anticipate developers that use Whisper (or similar models) in production to have their own Quality Assurance test sets and whisperkittools offers the tooling necessary to run the same measurements on such custom test sets, please see the Model Evaluation on Custom Dataset for details.

Why are there so many Whisper versions?

WhisperKit is an SDK for building speech-to-text features in apps across a wide range of Apple devices. We are working towards abstracting away the model versioning from the developer so WhisperKit "just works" by deploying the highest-quality model version that a particular device can execute. In the interim, we leave the choice to the developer by providing quality and size trade-offs.

Datasets

  • librispeech: ~5 hours of short English audio clips, tests short-form transcription quality
  • earnings22: ~120 hours of English audio clips from earnings calls with various accents, tests long-form transcription quality

Reproducing Results

Benchmark results on this page were automatically generated by whisperkittools using our cluster of Apple Silicon Macs as self-hosted runners on Github Actions. We periodically recompute these benchmarks as part of our CI pipeline. Due to security concerns, we are unable to open up the cluster to the public. However, any Apple Silicon Mac (even with 8GB RAM) can be used to run identical evaluation jobs locally. For reference, our M2 Ultra devices complete a librispeech + openai/whisper-large-v3 evaluation in under 1 hour regardless of the Whisper implementation. Oldest Apple Silicon Macs should take less than 1 day to complete the same evaluation.

Glossary

  • _turbo: Indicates the presence of additional optimizations (not compression) to unlock streaming transcription as described in our Blog Post.

  • _*MB: Indicates the presence of model compression. Instead of cluttering the filename with details like _AudioEncoder-5.8bits_TextDecoder-6.1bits_QLoRA-rank=16, we choose to summarize the compression spec as the resulting total file size since this is what matters to developers in production.

FAQ

Q1: xcrun: error: unable to find utility "coremlcompiler", not a developer tool or in PATH A1: Ensure Xcode is installed on your Mac and run sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer.

Citation

If you use WhisperKit for something cool or just find it useful, please drop us a note at [email protected]!

If you use WhisperKit for academic work, here is the BibTeX:

@misc{whisperkit-argmax,
title = {WhisperKit},
author = {Argmax, Inc.},
year = {2024},
URL = {https://github.com/argmaxinc/WhisperKit}
}

whisperkittools's People

Contributors

atiorh avatar jkrukowski avatar

Stargazers

Vusal Ismayilov avatar 爱可可-爱生活 avatar lynn lawson avatar Sandalots avatar Kadir Nar avatar James Long avatar Stéphane Busso avatar  avatar dai yun avatar  avatar Jason R Tibbetts avatar  avatar Wickramaranga Abeygunawardhana avatar JimmyLv_吕立青 avatar Neil Blake avatar Jinhui-Lin avatar Tao Xu avatar Mami K avatar Thi Vũ avatar Rasukarusan avatar Gurumurthi V Ramanan avatar Martin avatar Brooks Brasfield avatar Raphael Costa avatar Neil avatar antx avatar Kevin avatar Matthias Nehlsen avatar  avatar  avatar  avatar  avatar Adam Twardoch avatar Josscii avatar Arda Okan avatar Signal avatar  avatar OldFangDong avatar etongfu avatar Monteiro Steed avatar 00 avatar  avatar  avatar  avatar DerekMonster avatar wangyang avatar wwang avatar langhaidian avatar John Freier avatar Steve Freitag avatar  avatar  avatar  avatar MaFee921 avatar Trent Brew avatar  avatar Moonsik Song avatar  avatar Anh Minh Nguyen avatar Kaan Donbekci avatar Fred Bliss avatar Anthony avatar Andreas Motl avatar Nikolaus Schlemm avatar Aaron Mauro avatar  avatar Z avatar Łukasz Furman avatar 07 avatar Froesi avatar Tim Dolan avatar  avatar KBΓΓR avatar Cyril Zakka, MD avatar tcss avatar Ika Key avatar Liangyi Murong avatar Eugene Ware avatar  avatar Daniel Puglisi avatar t.tkmr avatar Zach Nagengast avatar Hongwei Qin avatar  avatar  avatar serinuntius avatar  avatar franz sy avatar Maruthi Prithivirajan avatar Robin Glauser avatar Diwank Singh Tomer avatar  avatar Abdulhakim Bashir avatar Sarang S avatar Jose Alberto Esquivel  avatar An Tran avatar Muhammad Umer avatar Andrew PH avatar  avatar Stephen avatar

Watchers

William Ferrell avatar Raphael Costa avatar Pedro Cuenca avatar Kenn avatar  avatar Vaibhav Srivastav avatar  avatar  avatar

whisperkittools's Issues

To set up and run in colab

How can we run this in colab ? I have clone this repo and try to run generate_model.py scripts but not able to generate model.

So can you please provide me steps to generate model in coreml?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.