Giter Club home page Giter Club logo

whisper.cpp's Introduction

whisper.cpp

whisper.cpp

Actions Status License: MIT npm

Stable: v1.5.5 / Roadmap | F.A.Q.

High-performance inference of OpenAI's Whisper automatic speech recognition (ASR) model:

Supported platforms:

The entire high-level implementation of the model is contained in whisper.h and whisper.cpp. The rest of the code is part of the ggml machine learning library.

Having such a lightweight implementation of the model allows to easily integrate it in different platforms and applications. As an example, here is a video of running the model on an iPhone 13 device - fully offline, on-device: whisper.objc

whisper-iphone-13-mini-2.mp4

You can also easily make your own offline voice assistant application: command

command-0.mp4

On Apple Silicon, the inference runs fully on the GPU via Metal:

metal-base-1.mp4

Or you can even run it straight in the browser: talk.wasm

Implementation details

  • The core tensor operations are implemented in C (ggml.h / ggml.c)
  • The transformer model and the high-level C-style API are implemented in C++ (whisper.h / whisper.cpp)
  • Sample usage is demonstrated in main.cpp
  • Sample real-time audio transcription from the microphone is demonstrated in stream.cpp
  • Various other examples are available in the examples folder

The tensor operators are optimized heavily for Apple silicon CPUs. Depending on the computation size, Arm Neon SIMD intrinsics or CBLAS Accelerate framework routines are used. The latter are especially effective for bigger sizes since the Accelerate framework utilizes the special-purpose AMX coprocessor available in modern Apple products.

Quick start

First clone the repository:

git clone https://github.com/ggerganov/whisper.cpp.git

Then, download one of the Whisper models converted in ggml format. For example:

bash ./models/download-ggml-model.sh base.en

Now build the main example and transcribe an audio file like this:

# build the main example
make

# transcribe an audio file
./main -f samples/jfk.wav

For a quick demo, simply run make base.en:

$ make base.en

cc  -I.              -O3 -std=c11   -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -pthread -c whisper.cpp -o whisper.o
c++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main  -framework Accelerate
./main -h

usage: ./main [options] file0.wav file1.wav ...

options:
  -h,        --help              [default] show this help message and exit
  -t N,      --threads N         [4      ] number of threads to use during computation
  -p N,      --processors N      [1      ] number of processors to use during computation
  -ot N,     --offset-t N        [0      ] time offset in milliseconds
  -on N,     --offset-n N        [0      ] segment index offset
  -d  N,     --duration N        [0      ] duration of audio to process in milliseconds
  -mc N,     --max-context N     [-1     ] maximum number of text context tokens to store
  -ml N,     --max-len N         [0      ] maximum segment length in characters
  -sow,      --split-on-word     [false  ] split on word rather than on token
  -bo N,     --best-of N         [5      ] number of best candidates to keep
  -bs N,     --beam-size N       [5      ] beam size for beam search
  -wt N,     --word-thold N      [0.01   ] word timestamp probability threshold
  -et N,     --entropy-thold N   [2.40   ] entropy threshold for decoder fail
  -lpt N,    --logprob-thold N   [-1.00  ] log probability threshold for decoder fail
  -debug,    --debug-mode        [false  ] enable debug mode (eg. dump log_mel)
  -tr,       --translate         [false  ] translate from source language to english
  -di,       --diarize           [false  ] stereo audio diarization
  -tdrz,     --tinydiarize       [false  ] enable tinydiarize (requires a tdrz model)
  -nf,       --no-fallback       [false  ] do not use temperature fallback while decoding
  -otxt,     --output-txt        [false  ] output result in a text file
  -ovtt,     --output-vtt        [false  ] output result in a vtt file
  -osrt,     --output-srt        [false  ] output result in a srt file
  -olrc,     --output-lrc        [false  ] output result in a lrc file
  -owts,     --output-words      [false  ] output script for generating karaoke video
  -fp,       --font-path         [/System/Library/Fonts/Supplemental/Courier New Bold.ttf] path to a monospace font for karaoke video
  -ocsv,     --output-csv        [false  ] output result in a CSV file
  -oj,       --output-json       [false  ] output result in a JSON file
  -ojf,      --output-json-full  [false  ] include more information in the JSON file
  -of FNAME, --output-file FNAME [       ] output file path (without file extension)
  -ps,       --print-special     [false  ] print special tokens
  -pc,       --print-colors      [false  ] print colors
  -pp,       --print-progress    [false  ] print progress
  -nt,       --no-timestamps     [false  ] do not print timestamps
  -l LANG,   --language LANG     [en     ] spoken language ('auto' for auto-detect)
  -dl,       --detect-language   [false  ] exit after automatically detecting language
             --prompt PROMPT     [       ] initial prompt
  -m FNAME,  --model FNAME       [models/ggml-base.en.bin] model path
  -f FNAME,  --file FNAME        [       ] input WAV file path
  -oved D,   --ov-e-device DNAME [CPU    ] the OpenVINO device used for encode inference
  -ls,       --log-score         [false  ] log best decoder scores of tokens
  -ng,       --no-gpu            [false  ] disable GPU


bash ./models/download-ggml-model.sh base.en
Downloading ggml model base.en ...
ggml-base.en.bin               100%[========================>] 141.11M  6.34MB/s    in 24s
Done! Model 'base.en' saved in 'models/ggml-base.en.bin'
You can now use it like this:

  $ ./main -m models/ggml-base.en.bin -f samples/jfk.wav


===============================================
Running base.en on all samples in ./samples ...
===============================================

----------------------------------------------
[+] Running base.en on samples/jfk.wav ... (run 'ffplay samples/jfk.wav' to listen)
----------------------------------------------

whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  215.00 MB (+    6.00 MB per decoder)
whisper_model_load: kv self size  =    5.25 MB
whisper_model_load: kv cross size =   17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.60 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   113.81 ms
whisper_print_timings:      mel time =    15.40 ms
whisper_print_timings:   sample time =    11.58 ms /    27 runs (    0.43 ms per run)
whisper_print_timings:   encode time =   266.60 ms /     1 runs (  266.60 ms per run)
whisper_print_timings:   decode time =    66.11 ms /    27 runs (    2.45 ms per run)
whisper_print_timings:    total time =   476.31 ms

The command downloads the base.en model converted to custom ggml format and runs the inference on all .wav samples in the folder samples.

For detailed usage instructions, run: ./main -h

Note that the main example currently runs only with 16-bit WAV files, so make sure to convert your input before running the tool. For example, you can use ffmpeg like this:

ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav

More audio samples

If you want some extra audio samples to play with, simply run:

make samples

This will download a few more audio files from Wikipedia and convert them to 16-bit WAV format via ffmpeg.

You can download and run the other models as follows:

make tiny.en
make tiny
make base.en
make base
make small.en
make small
make medium.en
make medium
make large-v1
make large-v2
make large-v3

Memory usage

Model Disk Mem
tiny 75 MiB ~273 MB
base 142 MiB ~388 MB
small 466 MiB ~852 MB
medium 1.5 GiB ~2.1 GB
large 2.9 GiB ~3.9 GB

Quantization

whisper.cpp supports integer quantization of the Whisper ggml models. Quantized models require less memory and disk space and depending on the hardware can be processed more efficiently.

Here are the steps for creating and using a quantized model:

# quantize a model with Q5_0 method
make quantize
./quantize models/ggml-base.en.bin models/ggml-base.en-q5_0.bin q5_0

# run the examples as usual, specifying the quantized model file
./main -m models/ggml-base.en-q5_0.bin ./samples/gb0.wav

Core ML support

On Apple Silicon devices, the Encoder inference can be executed on the Apple Neural Engine (ANE) via Core ML. This can result in significant speed-up - more than x3 faster compared with CPU-only execution. Here are the instructions for generating a Core ML model and using it with whisper.cpp:

  • Install Python dependencies needed for the creation of the Core ML model:

    pip install ane_transformers
    pip install openai-whisper
    pip install coremltools
    • To ensure coremltools operates correctly, please confirm that Xcode is installed and execute xcode-select --install to install the command-line tools.
    • Python 3.10 is recommended.
    • MacOS Sonoma (version 14) or newer is recommended, as older versions of MacOS might experience issues with transcription hallucination.
    • [OPTIONAL] It is recommended to utilize a Python version management system, such as Miniconda for this step:
      • To create an environment, use: conda create -n py310-whisper python=3.10 -y
      • To activate the environment, use: conda activate py310-whisper
  • Generate a Core ML model. For example, to generate a base.en model, use:

    ./models/generate-coreml-model.sh base.en

    This will generate the folder models/ggml-base.en-encoder.mlmodelc

  • Build whisper.cpp with Core ML support:

    # using Makefile
    make clean
    WHISPER_COREML=1 make -j
    
    # using CMake
    cmake -B build -DWHISPER_COREML=1
    cmake --build build -j --config Release
  • Run the examples as usual. For example:

    $ ./main -m models/ggml-base.en.bin -f samples/jfk.wav
    
    ...
    
    whisper_init_state: loading Core ML model from 'models/ggml-base.en-encoder.mlmodelc'
    whisper_init_state: first run on a device may take a while ...
    whisper_init_state: Core ML model loaded
    
    system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 1 |
    
    ...
    

    The first run on a device is slow, since the ANE service compiles the Core ML model to some device-specific format. Next runs are faster.

For more information about the Core ML implementation please refer to PR #566.

OpenVINO support

On platforms that support OpenVINO, the Encoder inference can be executed on OpenVINO-supported devices including x86 CPUs and Intel GPUs (integrated & discrete).

This can result in significant speedup in encoder performance. Here are the instructions for generating the OpenVINO model and using it with whisper.cpp:

  • First, setup python virtual env. and install python dependencies. Python 3.10 is recommended.

    Windows:

    cd models
    python -m venv openvino_conv_env
    openvino_conv_env\Scripts\activate
    python -m pip install --upgrade pip
    pip install -r requirements-openvino.txt

    Linux and macOS:

    cd models
    python3 -m venv openvino_conv_env
    source openvino_conv_env/bin/activate
    python -m pip install --upgrade pip
    pip install -r requirements-openvino.txt
  • Generate an OpenVINO encoder model. For example, to generate a base.en model, use:

    python convert-whisper-to-openvino.py --model base.en
    

    This will produce ggml-base.en-encoder-openvino.xml/.bin IR model files. It's recommended to relocate these to the same folder as ggml models, as that is the default location that the OpenVINO extension will search at runtime.

  • Build whisper.cpp with OpenVINO support:

    Download OpenVINO package from release page. The recommended version to use is 2023.0.0.

    After downloading & extracting package onto your development system, set up required environment by sourcing setupvars script. For example:

    Linux:

    source /path/to/l_openvino_toolkit_ubuntu22_2023.0.0.10926.b4452d56304_x86_64/setupvars.sh

    Windows (cmd):

    C:\Path\To\w_openvino_toolkit_windows_2023.0.0.10926.b4452d56304_x86_64\setupvars.bat

    And then build the project using cmake:

    cmake -B build -DWHISPER_OPENVINO=1
    cmake --build build -j --config Release
  • Run the examples as usual. For example:

    $ ./main -m models/ggml-base.en.bin -f samples/jfk.wav
    
    ...
    
    whisper_ctx_init_openvino_encoder: loading OpenVINO model from 'models/ggml-base.en-encoder-openvino.xml'
    whisper_ctx_init_openvino_encoder: first run on a device may take a while ...
    whisper_openvino_init: path_model = models/ggml-base.en-encoder-openvino.xml, device = GPU, cache_dir = models/ggml-base.en-encoder-openvino-cache
    whisper_ctx_init_openvino_encoder: OpenVINO model loaded
    
    system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | COREML = 0 | OPENVINO = 1 |
    
    ...
    

    The first time run on an OpenVINO device is slow, since the OpenVINO framework will compile the IR (Intermediate Representation) model to a device-specific 'blob'. This device-specific blob will get cached for the next run.

For more information about the Core ML implementation please refer to PR #1037.

NVIDIA GPU support

With NVIDIA cards the processing of the models is done efficiently on the GPU via cuBLAS and custom CUDA kernels. First, make sure you have installed cuda: https://developer.nvidia.com/cuda-downloads

Now build whisper.cpp with CUDA support:

make clean
WHISPER_CUDA=1 make -j

OpenCL GPU support via CLBlast

For cards and integrated GPUs that support OpenCL, the Encoder processing can be largely offloaded to the GPU through CLBlast. This is especially useful for users with AMD APUs or low end devices for up to ~2x speedup.

First, make sure you have installed CLBlast for your OS or Distribution: https://github.com/CNugteren/CLBlast

Now build whisper.cpp with CLBlast support:

Makefile:
cd whisper.cpp
make clean
WHISPER_CLBLAST=1 make -j

CMake:
cd whisper.cpp
cmake -B build -DWHISPER_CLBLAST=ON
cmake --build build -j --config Release

Run all the examples as usual.

BLAS CPU support via OpenBLAS

Encoder processing can be accelerated on the CPU via OpenBLAS. First, make sure you have installed openblas: https://www.openblas.net/

Now build whisper.cpp with OpenBLAS support:

make clean
WHISPER_OPENBLAS=1 make -j

BLAS CPU support via Intel MKL

Encoder processing can be accelerated on the CPU via the BLAS compatible interface of Intel's Math Kernel Library. First, make sure you have installed Intel's MKL runtime and development packages: https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-download.html

Now build whisper.cpp with Intel MKL BLAS support:

source /opt/intel/oneapi/setvars.sh
mkdir build
cd build
cmake -DWHISPER_MKL=ON ..
WHISPER_MKL=1 make -j

Docker

Prerequisites

  • Docker must be installed and running on your system.
  • Create a folder to store big models & intermediate files (ex. /whisper/models)

Images

We have two Docker images available for this project:

  1. ghcr.io/ggerganov/whisper.cpp:main: This image includes the main executable file as well as curl and ffmpeg. (platforms: linux/amd64, linux/arm64)
  2. ghcr.io/ggerganov/whisper.cpp:main-cuda: Same as main but compiled with CUDA support. (platforms: linux/amd64)

Usage

# download model and persist it in a local folder
docker run -it --rm \
  -v path/to/models:/models \
  whisper.cpp:main "./models/download-ggml-model.sh base /models"
# transcribe an audio file
docker run -it --rm \
  -v path/to/models:/models \
  -v path/to/audios:/audios \
  whisper.cpp:main "./main -m /models/ggml-base.bin -f /audios/jfk.wav"
# transcribe an audio file in samples folder
docker run -it --rm \
  -v path/to/models:/models \
  whisper.cpp:main "./main -m /models/ggml-base.bin -f ./samples/jfk.wav"

Limitations

  • Inference only

Another example

Here is another example of transcribing a 3:24 min speech in about half a minute on a MacBook M1 Pro, using medium.en model:

Expand to see the result
$ ./main -m models/ggml-medium.en.bin -f samples/gb1.wav -t 8

whisper_init_from_file: loading model from 'models/ggml-medium.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 4
whisper_model_load: mem required  = 1720.00 MB (+   43.00 MB per decoder)
whisper_model_load: kv self size  =   42.00 MB
whisper_model_load: kv cross size =  140.62 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     = 1462.35 MB
whisper_model_load: model size    = 1462.12 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

main: processing 'samples/gb1.wav' (3179750 samples, 198.7 sec), 8 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:08.000]   My fellow Americans, this day has brought terrible news and great sadness to our country.
[00:00:08.000 --> 00:00:17.000]   At nine o'clock this morning, Mission Control in Houston lost contact with our Space Shuttle Columbia.
[00:00:17.000 --> 00:00:23.000]   A short time later, debris was seen falling from the skies above Texas.
[00:00:23.000 --> 00:00:29.000]   The Columbia's lost. There are no survivors.
[00:00:29.000 --> 00:00:32.000]   On board was a crew of seven.
[00:00:32.000 --> 00:00:39.000]   Colonel Rick Husband, Lieutenant Colonel Michael Anderson, Commander Laurel Clark,
[00:00:39.000 --> 00:00:48.000]   Captain David Brown, Commander William McCool, Dr. Kultna Shavla, and Ilan Ramon,
[00:00:48.000 --> 00:00:52.000]   a colonel in the Israeli Air Force.
[00:00:52.000 --> 00:00:58.000]   These men and women assumed great risk in the service to all humanity.
[00:00:58.000 --> 00:01:03.000]   In an age when space flight has come to seem almost routine,
[00:01:03.000 --> 00:01:07.000]   it is easy to overlook the dangers of travel by rocket
[00:01:07.000 --> 00:01:12.000]   and the difficulties of navigating the fierce outer atmosphere of the Earth.
[00:01:12.000 --> 00:01:18.000]   These astronauts knew the dangers, and they faced them willingly,
[00:01:18.000 --> 00:01:23.000]   knowing they had a high and noble purpose in life.
[00:01:23.000 --> 00:01:31.000]   Because of their courage and daring and idealism, we will miss them all the more.
[00:01:31.000 --> 00:01:36.000]   All Americans today are thinking as well of the families of these men and women
[00:01:36.000 --> 00:01:40.000]   who have been given this sudden shock and grief.
[00:01:40.000 --> 00:01:45.000]   You're not alone. Our entire nation grieves with you,
[00:01:45.000 --> 00:01:52.000]   and those you love will always have the respect and gratitude of this country.
[00:01:52.000 --> 00:01:56.000]   The cause in which they died will continue.
[00:01:56.000 --> 00:02:04.000]   Mankind is led into the darkness beyond our world by the inspiration of discovery
[00:02:04.000 --> 00:02:11.000]   and the longing to understand. Our journey into space will go on.
[00:02:11.000 --> 00:02:16.000]   In the skies today, we saw destruction and tragedy.
[00:02:16.000 --> 00:02:22.000]   Yet farther than we can see, there is comfort and hope.
[00:02:22.000 --> 00:02:29.000]   In the words of the prophet Isaiah, "Lift your eyes and look to the heavens
[00:02:29.000 --> 00:02:35.000]   who created all these. He who brings out the starry hosts one by one
[00:02:35.000 --> 00:02:39.000]   and calls them each by name."
[00:02:39.000 --> 00:02:46.000]   Because of His great power and mighty strength, not one of them is missing.
[00:02:46.000 --> 00:02:55.000]   The same Creator who names the stars also knows the names of the seven souls we mourn today.
[00:02:55.000 --> 00:03:01.000]   The crew of the shuttle Columbia did not return safely to earth,
[00:03:01.000 --> 00:03:05.000]   yet we can pray that all are safely home.
[00:03:05.000 --> 00:03:13.000]   May God bless the grieving families, and may God continue to bless America.
[00:03:13.000 --> 00:03:19.000]   [Silence]


whisper_print_timings:     fallbacks =   1 p /   0 h
whisper_print_timings:     load time =   569.03 ms
whisper_print_timings:      mel time =   146.85 ms
whisper_print_timings:   sample time =   238.66 ms /   553 runs (    0.43 ms per run)
whisper_print_timings:   encode time = 18665.10 ms /     9 runs ( 2073.90 ms per run)
whisper_print_timings:   decode time = 13090.93 ms /   549 runs (   23.85 ms per run)
whisper_print_timings:    total time = 32733.52 ms

Real-time audio input example

This is a naive example of performing real-time inference on audio from your microphone. The stream tool samples the audio every half a second and runs the transcription continuously. More info is available in issue #10.

make stream
./stream -m ./models/ggml-base.en.bin -t 8 --step 500 --length 5000
rt_esl_csgo_2.mp4

Confidence color-coding

Adding the --print-colors argument will print the transcribed text using an experimental color coding strategy to highlight words with high or low confidence:

./main -m models/ggml-base.en.bin -f samples/gb0.wav --print-colors

image

Controlling the length of the generated text segments (experimental)

For example, to limit the line length to a maximum of 16 characters, simply add -ml 16:

$ ./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 16

whisper_model_load: loading model from './models/ggml-base.en.bin'
...
system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 |

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:00.850]   And so my
[00:00:00.850 --> 00:00:01.590]   fellow
[00:00:01.590 --> 00:00:04.140]   Americans, ask
[00:00:04.140 --> 00:00:05.660]   not what your
[00:00:05.660 --> 00:00:06.840]   country can do
[00:00:06.840 --> 00:00:08.430]   for you, ask
[00:00:08.430 --> 00:00:09.440]   what you can do
[00:00:09.440 --> 00:00:10.020]   for your
[00:00:10.020 --> 00:00:11.000]   country.

Word-level timestamp (experimental)

The --max-len argument can be used to obtain word-level timestamps. Simply use -ml 1:

$ ./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 1

whisper_model_load: loading model from './models/ggml-base.en.bin'
...
system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 |

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:00.320]
[00:00:00.320 --> 00:00:00.370]   And
[00:00:00.370 --> 00:00:00.690]   so
[00:00:00.690 --> 00:00:00.850]   my
[00:00:00.850 --> 00:00:01.590]   fellow
[00:00:01.590 --> 00:00:02.850]   Americans
[00:00:02.850 --> 00:00:03.300]  ,
[00:00:03.300 --> 00:00:04.140]   ask
[00:00:04.140 --> 00:00:04.990]   not
[00:00:04.990 --> 00:00:05.410]   what
[00:00:05.410 --> 00:00:05.660]   your
[00:00:05.660 --> 00:00:06.260]   country
[00:00:06.260 --> 00:00:06.600]   can
[00:00:06.600 --> 00:00:06.840]   do
[00:00:06.840 --> 00:00:07.010]   for
[00:00:07.010 --> 00:00:08.170]   you
[00:00:08.170 --> 00:00:08.190]  ,
[00:00:08.190 --> 00:00:08.430]   ask
[00:00:08.430 --> 00:00:08.910]   what
[00:00:08.910 --> 00:00:09.040]   you
[00:00:09.040 --> 00:00:09.320]   can
[00:00:09.320 --> 00:00:09.440]   do
[00:00:09.440 --> 00:00:09.760]   for
[00:00:09.760 --> 00:00:10.020]   your
[00:00:10.020 --> 00:00:10.510]   country
[00:00:10.510 --> 00:00:11.000]  .

Speaker segmentation via tinydiarize (experimental)

More information about this approach is available here: #1058

Sample usage:

# download a tinydiarize compatible model
./models/download-ggml-model.sh small.en-tdrz

# run as usual, adding the "-tdrz" command-line argument
./main -f ./samples/a13.wav -m ./models/ggml-small.en-tdrz.bin -tdrz
...
main: processing './samples/a13.wav' (480000 samples, 30.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, tdrz = 1, timestamps = 1 ...
...
[00:00:00.000 --> 00:00:03.800]   Okay Houston, we've had a problem here. [SPEAKER_TURN]
[00:00:03.800 --> 00:00:06.200]   This is Houston. Say again please. [SPEAKER_TURN]
[00:00:06.200 --> 00:00:08.260]   Uh Houston we've had a problem.
[00:00:08.260 --> 00:00:11.320]   We've had a main beam up on a volt. [SPEAKER_TURN]
[00:00:11.320 --> 00:00:13.820]   Roger main beam interval. [SPEAKER_TURN]
[00:00:13.820 --> 00:00:15.100]   Uh uh [SPEAKER_TURN]
[00:00:15.100 --> 00:00:18.020]   So okay stand, by thirteen we're looking at it. [SPEAKER_TURN]
[00:00:18.020 --> 00:00:25.740]   Okay uh right now uh Houston the uh voltage is uh is looking good um.
[00:00:27.620 --> 00:00:29.940]   And we had a a pretty large bank or so.

Karaoke-style movie generation (experimental)

The main example provides support for output of karaoke-style movies, where the currently pronounced word is highlighted. Use the -wts argument and run the generated bash script. This requires to have ffmpeg installed.

Here are a few "typical" examples:

./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -owts
source ./samples/jfk.wav.wts
ffplay ./samples/jfk.wav.mp4
jfk.wav.mp4

./main -m ./models/ggml-base.en.bin -f ./samples/mm0.wav -owts
source ./samples/mm0.wav.wts
ffplay ./samples/mm0.wav.mp4
mm0.wav.mp4

./main -m ./models/ggml-base.en.bin -f ./samples/gb0.wav -owts
source ./samples/gb0.wav.wts
ffplay ./samples/gb0.wav.mp4
gb0.wav.mp4

Video comparison of different models

Use the scripts/bench-wts.sh script to generate a video in the following format:

./scripts/bench-wts.sh samples/jfk.wav
ffplay ./samples/jfk.wav.all.mp4
jfk.wav.all.mp4

Benchmarks

In order to have an objective comparison of the performance of the inference across different system configurations, use the bench tool. The tool simply runs the Encoder part of the model and prints how much time it took to execute it. The results are summarized in the following Github issue:

Benchmark results

Additionally a script to run whisper.cpp with different models and audio files is provided bench.py.

You can run it with the following command, by default it will run against any standard model in the models folder.

python3 scripts/bench.py -f samples/jfk.wav -t 2,4,8 -p 1,2

It is written in python with the intention of being easy to modify and extend for your benchmarking use case.

It outputs a csv file with the results of the benchmarking.

ggml format

The original models are converted to a custom binary format. This allows to pack everything needed into a single file:

  • model parameters
  • mel filters
  • vocabulary
  • weights

You can download the converted models using the models/download-ggml-model.sh script or manually from here:

For more details, see the conversion script models/convert-pt-to-ggml.py or models/README.md.

Examples

There are various examples of using the library for different projects in the examples folder. Some of the examples are even ported to run in the browser using WebAssembly. Check them out!

Example Web Description
main whisper.wasm Tool for translating and transcribing audio using Whisper
bench bench.wasm Benchmark the performance of Whisper on your machine
stream stream.wasm Real-time transcription of raw microphone capture
command command.wasm Basic voice assistant example for receiving voice commands from the mic
wchess wchess.wasm Voice-controlled chess
talk talk.wasm Talk with a GPT-2 bot
talk-llama Talk with a LLaMA bot
whisper.objc iOS mobile application using whisper.cpp
whisper.swiftui SwiftUI iOS / macOS application using whisper.cpp
whisper.android Android mobile application using whisper.cpp
whisper.nvim Speech-to-text plugin for Neovim
generate-karaoke.sh Helper script to easily generate a karaoke video of raw audio capture
livestream.sh Livestream audio transcription
yt-wsp.sh Download + transcribe and/or translate any VOD (original)
server HTTP transcription server with OAI-like API

If you have any kind of feedback about this project feel free to use the Discussions section and open a new topic. You can use the Show and tell category to share your own projects that use whisper.cpp. If you have a question, make sure to check the Frequently asked questions (#126) discussion.

whisper.cpp's People

Contributors

0cc4m avatar abhilash1910 avatar aidanbeltons avatar asmaloney avatar bobqianic avatar boolemancer avatar cebtenzzre avatar cherts avatar didzis avatar digipom avatar felrock avatar finnvoor avatar fitzsim avatar ggerganov avatar ikawrakow avatar jhen0409 avatar johannesgaessler avatar josharian avatar katsu560 avatar marmistrz avatar nalbion avatar neozhangjianyu avatar przemoc avatar ptsochantaris avatar sandrohanea avatar slaren avatar tamo avatar ulatekh avatar xarbirus avatar zhouwg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

whisper.cpp's Issues

inlining failed in call to 'always_inline' 'vfmaq_f16': target specific option mismatch

I am trying to compile for ARM64 and there seems to be an issue with some vector functions:

> [linux/arm64 builder 5/5] RUN gcc -pthread -O3 -march=native -c ggml.c &&     g++ -pthread -O3 -std=c++11 -c main.cpp &&     g++ -pthread -o main ggml.o main.o:
#29 3.977 ggml.c:506:14: note: called from here
#29 3.977   506 |         y1 = vfmaq_f16(y1, x1, v8);
#29 3.977       |              ^~~~~~~~~~~~~~~~~~~~~
#29 3.978 In file included from ggml.c:47:
#29 3.978 /usr/lib/gcc/aarch64-linux-gnu/10/include/arm_neon.h:33208:1: error: inlining failed in call to 'always_inline' 'vfmaq_f16': target specific option mismatch
#29 3.978 33208 | vfmaq_f16 (float16x8_t __a, float16x8_t __b, float16x8_t __c)
#29 3.978       | ^~~~~~~~~
#29 3.978 ggml.c:505:14: note: called from here
#29 3.978   505 |         y0 = vfmaq_f16(y0, x0, v8);
#29 3.978       |              ^~~~~~~~~~~~~~~~~~~~~
------
Dockerfile:11
--------------------
  10 |     ADD whisper.cpp/ /build/
  11 | >>> RUN gcc -pthread -O3 -march=native -c ggml.c && \
  12 | >>>     g++ -pthread -O3 -std=c++11 -c main.cpp && \
  13 | >>>     g++ -pthread -o main ggml.o main.o
  14 |     
--------------------
ERROR: failed to solve: process "/bin/sh -c gcc -pthread -O3 -march=native -c ggml.c &&     g++ -pthread -O3 -std=c++11 -c main.cpp &&     g++ -pthread -o main ggml.o main.o" did not complete successfully: exit code: 1

Tested on GitHub actions (logs) and on a Raspberry Pi 4.

Dockerfile:

# build image
FROM debian:bullseye-slim AS builder
WORKDIR /build/
RUN apt-get update && apt-get install --no-install-recommends -y \
    make gcc g++ wget \
 && apt-get clean \
 && rm -rf /var/lib/apt/lists/*

# Install Whisper.cpp
ADD whisper.cpp/ /build/
RUN gcc -pthread -O3 -march=native -c ggml.c && \
    g++ -pthread -O3 -std=c++11 -c main.cpp && \
    g++ -pthread -o main ggml.o main.o

ggml_graph_compute: Assertion `false' failed

Examples work fine for me, but I get an error when trying different wavefiles larger than 21s:

./main -m models/ggml-base.en.bin mycut-22s-16khz.wav
whisper_model_load: loading model from 'models/ggml-base.en.bin'
[...]
main: processing 'mycut-22s-16khz.wav' (352000 samples, 22.0 sec), 2 threads, lang = en, task = transcribe, timestamps = 1 ...
[...]

main: ggml.c:6658: ggml_graph_compute: Assertion `false' failed.
Aborted (core dumped)

It's apparently fixed if I comment the assert and substitute by cgraph->work = NULL; in ggml.c. But I guess it's not the best workaround, as it crashes again with sigfault if audio duration is more than 43s aproximately.

Running on Ubuntu 18.04 LTS (GNU/Linux 4.15.0-187-generic x86_64), after fixing CACHE_LINE_SIZE and initializers issues #11

Any hint? Thanks!

Not working on MacOS (ARM)

Hi, I've been trying to get this to work a few times, but it always fails with an illegal hardware instruction error.

E.g. for ./main -m models/ggml-small.bin -f samples/jfk.wav I get the following output:

whisper_model_load: loading model from 'models/ggml-small.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1048.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 533.05 MB
fish: Job 1, './main -m models/ggml-small.b...' terminated by signal SIGILL (Illegal instruction)

I've tried other models as well, but the result is always the same.

Language selection

I'm glad you shared this implementation.
A steep increase in performance relative to the torch on the CPU.

It is possible that you already know, but found how to enable recognition of a certain language.
We just can put in line 2012 main.cpp this:

std::vector<whisper_vocab::id> prompt = { vocab.token_sot, vocab.token_lang, vocab.token_task };  

This 3 tokens formed here:
https://github.com/openai/whisper/blob/8cf36f3508c9acd341a45eb2364239a3d81458b9/whisper/tokenizer.py#L324-L331

For specific use in main.cpp, you can simply specify the desired index manually. But for regular users, it would be cool to specify which language they would prefer to see in the output.

This code but with CUDA

Does anyone have any ideas of how to use this code but with CUDA libs? I want to move away from the Python version but keep PyTorch CUDA.

Comparison with torch jit

Great work! I find especially the implementation of ggml interesting. It looks like you implement all the basic neural network building blocks with ggml. How do you compare it with the torch jit approach of using a pytorch model in c++?

Cheaper hardware to run bigger model

Refer our discussion at #8 , I can run ggml-large.bin, for same input audio 120 sec ( 2 minutes) in around 54 minutes on Samsung A52.

What is your suggestion for optimization to run bigger model on cheaper hardware:

  1. Selecting better hardware for array manipulation (Neon?)
  2. Improve algorithm
  3. Use GPU provided by hardware
  4. ... ?

Will be happy if you share the resource I can learn to achieve that goal.

Android example app

Implement a very basic Java application using whisper.cpp. It can be used as an example for running Whisper on Android.

The ggwave-java project can be used as a good starting point. It already provides the audio capture functionality. Instead of passing it to ggwave, we just need to pass it to whisper.cpp.

Edit:
Looking for volunteers to help with this - ideally, we would like to have the same functionality demonstrated as in the iOS example application.

Thread 1 "main" received signal SIGILL, Illegal instruction.

Seemed to come from here:

0x00005555555586fb in _mm256_fmadd_ps (__C=..., __B=..., __A=...) at /usr/lib/gcc/x86_64-linux-gnu/7/include/fmaintrin.h:65
65	  return (__m256)__builtin_ia32_vfmaddps256 ((__v8sf)__A, (__v8sf)__B,

with backtrace

(gdb) bt
#0  0x00005555555586fb in _mm256_fmadd_ps (__C=..., __B=..., __A=...) at /usr/lib/gcc/x86_64-linux-gnu/7/include/fmaintrin.h:65
#1  ggml_vec_dot_f16 (n=96, s=0x7ffffffe4e54, x=0x7fff646b6ee0, y=0x7fff64746ee0) at ggml.c:375
#2  0x0000555555564766 in ggml_compute_forward_conv_1d_1s_f16_f32 (params=0x7ffffffe51c0, src0=0x7fff9025f0f0, src1=0x7fff65482030, dst=0x7fff6556c6f0) at ggml.c:4668
#3  0x0000555555564f40 in ggml_compute_forward_conv_1d_1s (params=0x7ffffffe51c0, src0=0x7fff9025f0f0, src1=0x7fff65482030, dst=0x7fff6556c6f0) at ggml.c:4806
#4  0x0000555555568707 in ggml_compute_forward (params=0x7ffffffe51c0, tensor=0x7fff6556c6f0) at ggml.c:5809
#5  0x000055555556a6ec in ggml_graph_compute (ctx=0x5555557f3b48 <g_state+104>, cgraph=0x7ffffffe5340) at ggml.c:6611
#6  0x0000555555580cb2 in whisper_encode (model=..., n_threads=4, mel_offset=0, mel_inp=..., features=std::vector of length 0, capacity 0) at main.cpp:1353
#7  0x0000555555584664 in main (argc=5, argv=0x7fffffffdb78) at main.cpp:2225

On Ubuntu 18.04, gcc 7.5.0, on an Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz

C API threadsafety

I can't see any docs as to threadsafety for the C API. Information here would be very helpful for me and future users. Thanks!

Support for realtime audio input

Noting that the processing time is considerably shorter than the length of speech, is it possible to feed the models real time microphone output? Or does the inference run on the complete audio stream, instead of sample by sample?

This would greatly reduce the latency for voice assistants and the like, that the audio does not need to be fully captured and only after that fed to the models. Basically the same as I did here with SODA: https://github.com/biemster/gasr, but then with an open source and multilang model.

Tutorial on implementaion of ggml?

Hi georgi, I am sure this is not the right platform to make an unreasonable request. Could you make a tutorial or docs how did you went on implementing ggml and especially the design.
I am personally lacking this skill.

Thank you

make stream fails due to missing dependency

If you normally try to build stream with make stream, it will fail with:

g++ -O3 -std=c++11 -Wall -Wextra -Wno-unused-parameter -Wno-unused-function -pthread stream.cpp ggml.o whisper.o -o stream `sdl2-config --cflags --libs`
/bin/sh: 1: sdl2-config: not found
stream.cpp:12:10: fatal error: SDL.h: No such file or directory
   12 | #include <SDL.h>
      |          ^~~~~~~
compilation terminated.
make: *** [Makefile:76: stream] Error 1

The missing dependency for this is https://www.libsdl.org/ and can be installed with:

sudo apt-get install libsdl2-dev

Would be nice to add this to the README, I might do this later if I have time.

CMake builds run MUCH slower than Make builds

Hi, and thanks so much for this project. It's really, really fast. I've been compiling M1 Mac, Intel Mac, and Windows, and I've noticed something across the board: CMake builds run much, much slower (3-4x) than via Make. I would love to put some time into fixing this and PRing, but I'm really busy right now.

I may have time in a couple of weeks to contribute but just wanted to put this on your radar in case there's some obvious easy fix.

Hosting the ggml models in the cloud

Currently, I am hosting the ggml Whisper model files on my Linode server.
However, it has a limited network bandwidth per month and as more people start using whisper.cpp it won't be enough.

What are some good options for hosting ~10GB of data?

The only requirement is to be able to wget/curl the files directly - i.e. Google Drive and alike are not an option.

SIGFPE on certain audio files

Hey there! I'm testing out whisper.cpp to see if it would be suitable for production use. However I'm running into a SIGFPE on certain audio files: namely those that do not produce any output from the model. Because of the way my system is set up, I'm unable to provide any test files that can reproduce this bug.

However, I was able to build the library with debug symbols and trigger the exception. It seems to be a divide-by-zero error on line 2349 of whisper.cpp:

int progress_cur = (100*seek)/whisper_n_len(ctx);

The GDB output is as follows:

Thread 21 "scripty_stt_ser" received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0x7ffff7085700 (LWP 3869)]
0x0000555555599123 in whisper_full (ctx=0x5555556f6a80, params=..., samples=<optimized out>, n_samples=<optimized out>) at whisper.cpp:2349
2349            int progress_cur = (100*seek)/whisper_n_len(ctx);

Unfortunately, despite compiling with debug symbols (-g flag), bt gave no extra info beyond that:

(gdb) bt
#0  0x0000555555599123 in whisper_full (ctx=0x5555556f6a80, params=..., samples=<optimized out>, n_samples=<optimized out>) at whisper.cpp:2349
#1  0x0000555555593cf6 in whisper_rs::whisper_ctx::WhisperContext::full (self=<optimized out>, params=..., data=...) at src/whisper_ctx.rs:390

Let me know if there's anything else I can do to help!

Correct parameter for cross compile for ARM Android ?

What is correct parameter for cross compile for ARM Android ? I'm using Intel Ubuntu , android-ndk-r25b


ggml.c:232:16: warning: implicit declaration of function 'vfmaq_f32' is invalid in C99 [-Wimplicit-function-declaration]
        sum0 = vfmaq_f32(sum0, x0, y0);
               ^
ggml.c:232:14: error: assigning to 'float32x4_t' (vector of 4 'float32_t' values) from **incompatible type** 'int'
        sum0 = vfmaq_f32(sum0, x0, y0);
             ^ ~~~~~~~~~~~~~~~~~~~~~~~

./ggml.c:331:14: error: assigning to 'float16x8_t' (vector of 8 'float16_t' values) from **incompatible type** 'int'
        sum0 = vfmaq_f16(sum0, x0, y0);
             ^ ~~~~~~~~~~~~~~~~~~~~~~~             

PyTorch performance for Linear layer is 4 times faster than matmul

I am doing some performance optimizations in ggml and it seems that the PyTorch's Linear layer currently outperforms my implementation by a factor of ~4 for big matrices. I am wondering what is the secret there and if someone can give me some tips how to achieve this performance.


Consider the following line from the original whisper implementation:

https://github.com/openai/whisper/blob/e90b8fa7e845ae184ed9aa0babcf3cde6f16719e/whisper/model.py#L73

This is effectively equivalent to a matrix multiplication of x with a square weights matrix from the model (encoder.blocks.0.attn.query.weight) and sum with a bias vector (encoder.blocks.0.attn.query.bias).

I compared the runtime for this line with an explicit matrix multiplication of same size matrices.
To do that, I replaced the line with this piece of code:

# original
        q = self.query(x)

# modified
        start = time.time()
        q = self.query(x)
        print('time for self.query(x) = ', time.time() - start)

        start = time.time()
        r0 = torch.rand(x.shape[1], x.shape[2], dtype=torch.float32)
        r1 = torch.rand(x.shape[2], x.shape[2], dtype=torch.float32)
        r2 = r0 @ r1
        print('time for r2 (mat_mul)  = ', time.time() - start)

        print(self.query)
        print(' x shape = ',  x.shape, ' dtype = ',  x.dtype)
        print('r0 shape = ', r0.shape, ' dtype = ', r0.dtype)
        print('r1 shape = ', r1.shape, ' dtype = ', r1.dtype)
        print('r2 shape = ', r2.shape, ' dtype = ', r2.dtype)

I would have expected that time for self.query(x) to be equal to time for r2 (mat_mul).
However, here is the result on my MacBook when running the large model:

time for self.query(x) =  0.0034177303314208984
time for r2 (mat_mul)  =  0.012507200241088867
Linear(in_features=1280, out_features=1280, bias=True)
 x shape =  torch.Size([1, 1500, 1280])  dtype =  torch.float32
r0 shape =  torch.Size([1500, 1280])  dtype =  torch.float32
r1 shape =  torch.Size([1280, 1280])  dtype =  torch.float32
r2 shape =  torch.Size([1500, 1280])  dtype =  torch.float32

So the Linear layer is almost 4 times faster (3.4 ms vs 12.5 ms) compared to explicit matrix multiplication.


How do we explain this difference?

Is PyTorch using some int8 quantisation technique under the hood to speed up this layer? If so, how can I verify that this is the case?

Any insight will be very much appreciated!

ggml.c CACHE_LINE_SIZE error: initializer element is not constant

I get this on Ubuntu 18.04 gcc 7.5.0 (time to update, yes), and I don't immediately see how to fix it since I don't know __cpp_lib_hardware_interference_size. Otherwise a simple replacement with a #define would suffice.

gcc -pthread -O3 -mavx -mavx2 -mfma -mf16c -c ggml.c
ggml.c:183:36: error: initializer element is not constant
 const size_t CACHE_LINE_SIZE_F32 = CACHE_LINE_SIZE/sizeof(float);

Python bindings (C-style API)

Good day everyone!
I'm thinking about bindings for Python.

So far, I'm interested in 4 functionalities:

  1. Encoder processing
  2. Decoder processing
  3. Transcription of audio (feed audio bytes, get text)
  4. 3+Times of all words (feed audio bytes, get text + times of each word). Of course, it’s too early to think about the times of words, since even for a python implementation they are still not well done.

Perhaps in the near future, I will try to take up this task. But I had no experience with python bindings. So, if there are craftsmen who can do it quickly (if it can be done quickly... 😃), that would be cool!

Running inference over a large batch of audio files

Hi! Firstly, thank you so much for this incredible work!

I have been running the tiny.en models on a large number of wav files stored in a folder. I am currently parallelizing the work over a multi-core machine using GNU parallel and running the following command :

find input_data/eng_wav_data -name "*.wav" | parallel 'time ./main -m models/ggml-tiny.en.bin -nt -f {} -t 1 > {.}.txt'

I found that currently the model is loaded each time we have to transcribe a wav file. Is there a way I can circumvent this and load the model only once? Any help would be appreciated. Thank you. Apologies if this issue has been resolved already

transcription time 2.7x than wav file

Thanks for sharing whisper.cpp @ggerganov. Wondering if I'm missing something. I tried whisper.cpp on a 40-minute wav file, which took almost 2 hours to transcribe, which doesn't seem to be what others have experienced. I tried transcribing on an 8-vcpu machine, 32 gb of memory. Any settings I'm missing? Appreciate your help.

Unfortunately I'm unable to share the wav file as it's private data.

`
whisper_model_load: loading model from 'models/ggml-large.bin'
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 32
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 5
whisper_model_load: mem_required = 4576.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 3255.34 MB
whisper_model_load: memory size = 304.38 MB
whisper_model_load: model size = 2950.66 MB

main: processing 'output/x.wav' (38688821 samples, 2418.1 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...
whisper_print_timings: load time = 4246.85 ms
whisper_print_timings: mel time = 31377.23 ms
whisper_print_timings: sample time = 3421.71 ms
whisper_print_timings: encode time = 4697475.00 ms / 146796.09 ms per layer
whisper_print_timings: decode time = 1830579.38 ms / 57205.61 ms per layer
whisper_print_timings: total time = 6568016.00 ms
`

Need -pthread for make on ubuntu 20.04

Makefile

main: ggml.o main.o
	g++ -pthread -o main ggml.o main.o
	./main -h

ggml.o: ggml.c ggml.h
	gcc -pthread -O3 -mavx -mavx2 -mfma -mf16c -c ggml.c

main.o: main.cpp ggml.h
	g++ -pthread -O3 -std=c++11 -c main.cpp

Windows build

Would be nice if someone can help and provide build instructions for Windows.

I think the only thing that might need an update is the pthread dependency in ggml.c.
The rest of the code should build successfully.

Probably a .bat script to download the models would also be nice since no Bash on Windows.

WASM port

We can easily build whisper.cpp as a WASM library using Emscripten:

mkdir build-em
cd build-em
emcmake cmake ..
make

It looks like a big subset of SIMD intrinsics are already supported, so the performance might not be really bad:

https://emscripten.org/docs/porting/simd.html

So let's try running whisper.cpp directly in the browser!

  • The model file could either be fetched on load, or the user can drag and drop it in the browser window
  • We need a simple page that records a short audio at 16 kHz sampling rate and passes it to the WASM module for transcription. Probably something similar to this ggwave example can be used

Timestamps for words instead of sentence possible?

Do you think that could be possible in some way?

I would like to get the time stamp of each word instead the sentence (words bundle).
That could be useful to some kind of karaoke lyrics generator,
or just to text to “lip sync” in a kind of video clip or 3d character synchro.

Cheers

whisper : mark speakers/voices (diarization)

Hi,

I'm not so much into the details of whisper or whisper.cpp and I don't know if it is currently even possible with the foundation, but it would be nice if speakers could be marked or speaker-changes / voice-changes.

This would be very handy when processing interviews, radio/tv shows, films, etc.

Kind regards,
abelbabel

[Feature] recognize data coming via pipe stream

Hi,

it would be great to have a simple app that takes data from pipe and runs recognition on it ... similar to stream.cpp, but instead taking data from audio device, taking it from pipe ...

Could also be an addition to the main-example, so that you can use it like this:

cat samples/jfk.wav | ./main -m models/ggml-medium.bin -f -

Here something similar is done with vosk and python. (ffmpeg-pre-processing could be something people can do on their own before filling the pipe and not part of the app ...)

Kind regards,
abelbabel

Unicode/Encoding Issue with Japanese Text

I'm trying running Japanese audio files through whisper.cpp, and the output is returning some "corrupted" output.

Here is the output from whisper and whisper.cpp for comparison:

Command Output
whisper output.wav --model large --language Japanese さくらちゃん**神経もすっごくいいし、バトンもうまいんだけど
./main -m models/ggml-large.bin -l ja -f output.wav さくらちゃん**神��もすっごくいいし、バトンもうまいんだけど。

The expected 「神経も」 portion is the following in hex:

0xE7A59E 0xE7B58C 0xE38282

The "corrupted" 「神��も」 portion is:

0xE7A59E 0xEFBFBD 0xEEBFBD 0xE38282


Note: I had to comment out a few lines from whisper.cpp around line 2300 for "make" to compile. I do not know if this would impact it.

                    .beam_search = {
                        //.n_past = 0,
                        //.beam_width = 10,
                        //.n_best = 5,
                    },

Output file

Hello there. Seems like redirecting the standard output with either >, >> or tee doesn't work. Would be nice to have an option to save the output to a specific file.

./whisper.cpp/whisper.h:121:81: error: unknown type name 'bool'

I'm attempting to automate rust-bindgen generation. This appears to not work, however, as it uses clang which does not implicitly #include <stdbool.h>. Adding #include <stdbool.h> to line 5 of whisper.h appears to fix this. I'm opening this issue to get feedback and others' thoughts.

Error: "whisper_full: failed to generate timestamp token - this should not happen"

I was running a task on a german language youtube video with the command line
./main -m ggml-base.bin bauer.wav -t 8 -l de -osrt
and the process ran ok until around the 4-minute mark, then I've got the error:

"whisper_full: failed to generate timestamp token - this should not happen"

repeated several times, and the transcription never resumed.
I changed the command line to use 4 cores, didn´t include the srt file generation and still the same error.
Curiously, if I force english transcription with "-l en", the transcription is ok until 4 minutes or so and then the same sentence repeats until the end of the file.

I think this happened after the commit to reduce the sentence length.

How do I compile to a shared library? without libc++_shared.so ?

I want to experiment with using whisper into the app, but when I open it, an error occurs when the compiled library requires libc++_shared.so,

i use this bash to build for android target

/home/azkdev/Android/Sdk/ndk/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android21-clang -pthread -O3 -std=c11 -mavx -mavx2 -mfma -mf16c -c ./ggml.c -fPIC -lstdc++
/home/azkadev/Android/Sdk/ndk/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android21-clang++ -pthread -O3 -std=c++11 -mavx -mavx2 -mfma -mf16c -c ./whisper.cpp -fPIC -lstdc++
/home/azkadev/Android/Sdk/ndk/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android21-clang++ -pthread -O3 -std=c++11 ./main.cpp -fPIC -lstdc++ whisper.o ggml.o -o ./whisper.so --shared -fPIC -lstdc++

I have also tried this clang-linking-so-library-libc-shared-so, but it doesn't work

Error
Screenshot from 2022-10-08 19-17-21

can you give a build command so it doesn't need libc++_shared.so? sorry i'm still a beginner in cpp

Feature request

Hi @ggerganov
whisper.cpp look promising, thank you for you work.
I know there is timestamp limitation in the README currently.
Is it possible to include timestamp in the future? that will be useful when generate subtitle.
Or can whisper.cpp support stream mode with steaming audio.

Make fails

g++ -O3 -std=c++11 -Wall -Wextra -Wno-unused-parameter -Wno-unused-function -pthread -c whisper.cpp whisper.cpp: In function ‘whisper_full_params whisper_full_default_params(whisper_decode_strategy)’: whisper.cpp:2286:17: sorry, unimplemented: non-trivial designated initializers not supported }; ^ whisper.cpp:2313:17: sorry, unimplemented: non-trivial designated initializers not supported }; ^ Makefile:74: recipe for target 'whisper.o' failed make: *** [whisper.o] Error 1

g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Performance Xeon

Performance report.
Meaning V2 and V3: V2 its before this commit

  • CPU: Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz
  • Task: 200 s of audio (7 diff files with diff quality)

V2 -t

model T, s -t, CPU
tiny 64 1
tiny 21 4
tiny 21 8
tiny 80 16
tiny 175 24
base 42 8
base 93 16
small 110 8
small 190 16
large 420 8
large 537 16

V3 -t

model T, s -t, CPU
tiny 84 1
tiny 32 4
tiny 28 8
tiny 56 16
tiny 86 24
base 58 8
base 125 16
small 104 8
small 177 16
large 570 8
large 850 16

V2 parallel

  • Use parallel bash computations
  • 7 parallel jobs, in each job -t specified
model T, s -t, CPU
tiny 17 1
tiny 9 2
tiny 5 4
base 56 1
base 25 2
base 16 4
small 155 1
small 86 2
small 53 4
large 788 1
large 428 2
large 260 4

Encode vs Decode time (V2 vs V3) tiny

V2

  • File 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 452.00 MB
main:     load time =    84.28 ms
main:      mel time =   118.88 ms
main:   sample time =    46.91 ms
main:   encode time =   531.27 ms / 132.82 ms per layer
main:   decode time =  3730.47 ms
main:    total time =  6181.17 ms
  • File 2
main:     load time =    80.49 ms
main:      mel time =    97.64 ms
main:   sample time =    13.85 ms
main:   encode time =   533.10 ms / 133.27 ms per layer
main:   decode time =  1036.91 ms
main:    total time =  2348.79 ms

V3

  • File 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 244.00 MB
main:     load time =   241.68 ms
main:      mel time =   656.11 ms
main:   sample time =  1202.84 ms
main:   encode time =  1736.55 ms / 434.14 ms per layer
main:   decode time =  8354.48 ms
main:    total time = 12211.61 ms
  • File 2
main:     load time =   243.57 ms
main:      mel time =   541.42 ms
main:   sample time =   209.42 ms
main:   encode time =  2901.70 ms / 725.42 ms per layer
main:   decode time =  1588.76 ms
main:    total time =  5501.20 ms

/whisper.cpp/whisper.cpp:2305:17: internal compiler error: in reshape_init_class, at cp/decl.c:6465

Fully stumped. Only did make. cpp (Ubuntu 11.2.0-19ubuntu1) 11.2.0

whisper.cpp: In function ‘whisper_full_params whisper_full_default_params(whisper_decode_strategy)’: whisper.cpp:2305:17: internal compiler error: in reshape_init_class, at cp/decl.c:6465 2305 | }; | ^ 0x7f415aaa6d8f __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 0x7f415aaa6e3f __libc_start_main_impl ../csu/libc-start.c:392 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See <file:///usr/share/doc/gcc-11/README.Bugs> for instructions.

MUSL Linux builds

Hi there! I'm attempting to build whisper.cpp for MUSL Linux for some lightweight systems, and I figured I would note the issues I ran into during the build.

  1. Alpine appears to not include stdint.h, or alloca.h in its standard library when you install gcc. This results in a slew of errors:
localhost:~/whisper.cpp# make libwhisper.a
cc  -O3 -std=c11   -Wall -Wextra -Wno-unused-parameter -Wno-unused-function -pthread   -c ggml.c
In file included from ggml.h:7,
                 from ggml.c:1:
/usr/lib/gcc/aarch64-alpine-linux-musl/12.2.1/include/stdint.h:9:26: error: no include path in which to search for stdint.h
    9 | # include_next <stdint.h>
      |                          ^
ggml.h:107:5: error: unknown type name 'int64_t'
  107 |     int64_t perf_cycles;
      |     ^~~~~~~
~~snip~~

ggml.c:6:10: fatal error: alloca.h: No such file or directory
    6 | #include <alloca.h>
      |          ^~~~~~~~~~
compilation terminated.
make: *** [Makefile:58: ggml.o] Error 1
localhost:~/whisper.cpp#

This fix is relatively simple, just install g++:

apk add g++
  1. clock_gettime and CLOCK_MONOTONIC are seemingly undefined regardless of compiler used.
localhost:~/whisper.cpp# make libwhisper.a
cc  -O3 -std=c11   -Wall -Wextra -Wno-unused-parameter -Wno-unused-function -pthread   -c ggml.c
ggml.c: In function 'ggml_time_ms':
ggml.c:155:5: warning: implicit declaration of function 'clock_gettime' [-Wimplicit-function-declaration]
  155 |     clock_gettime(CLOCK_MONOTONIC, &ts);
      |     ^~~~~~~~~~~~~
ggml.c:155:19: error: 'CLOCK_MONOTONIC' undeclared (first use in this function)
  155 |     clock_gettime(CLOCK_MONOTONIC, &ts);
      |                   ^~~~~~~~~~~~~~~
ggml.c:155:19: note: each undeclared identifier is reported only once for each function it appears in
ggml.c: In function 'ggml_time_us':
ggml.c:161:19: error: 'CLOCK_MONOTONIC' undeclared (first use in this function)
  161 |     clock_gettime(CLOCK_MONOTONIC, &ts);
      |                   ^~~~~~~~~~~~~~~
make: *** [Makefile:58: ggml.o] Error 1
localhost:~/whisper.cpp# 

Digging around the internet shows a fix for this as inserting #define _POSIX_C_SOURCE 199309L before including the time.h header. This appears to work successfully, placing it on line 10 of ggml.c. It would be nice if this issue could be fixed in some way. I would make a PR if I had sufficient knowledge to implement the required changes, which I don't.

Compile error: internal compiler error

I have this issue when trying to compile the most recent version (as of 16 oct 2022):

(base) user@pc:~/whisper.cpp$ make
g++ -O3 -std=c++11 -Wall -Wextra -Wno-unused-parameter -Wno-unused-function -pthread -c whisper.cpp
whisper.cpp: In function ‘whisper_full_params whisper_full_default_params(whisper_decode_strategy)’:
whisper.cpp:2305:17: internal compiler error: in reshape_init_class, at cp/decl.c:6465
2305 | };
| ^
0x7fdf6ca75d8f __libc_start_call_main
../sysdeps/nptl/libc_start_call_main.h:58
0x7fdf6ca75e3f __libc_start_main_impl
../csu/libc-start.c:392
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See file:///usr/share/doc/gcc-11/README.Bugs for instructions.
make: *** [Makefile:61: whisper.o] Error 1

just to be sure it wasn't my setup, I compiled the fork I have at https://github.com/Topping1/whisper.cpp (2 commits ahead, 8 commits behind) and it compiled fine. For further verification I ran a diff between the two whisper.cpp and found this:
(left is whisper.cpp at my repository, right is the updated one (as of 16-oct-2022)

image

Do you know what might be causing the issue?

Add function to clear model words

Hi there! I'm trying to save compute resources by reusing WhisperContext objects in a STT server instance, but if no words were detected in the audio, it will cause whatever words were found in the last transcription that had words detected to be spit out again. This is a major issue, and I'd like a way to prevent this. The easiest way I can think of is adding a function to clear the words stored in the model. I considered adding such a feature to my app, but I realized this could cause serious overhead and introduce user privacy risks from storing many sentences compared to just clearing the words from the model itself. Thanks!

iOS example app

Implement a very basic iOS application using whisper.cpp

The ggwave-objc project can be used as a good starting point. It already provides the audio capture functionality. We just need to pass the captured data to whisper.cpp.

Build on FreeBSD

Hi,

I could compile this on FreeBSD 13.1-RELEASE-p2 amd64, having devel/gmake installed (using gmake then instead of make) and using the following modifications:

--- Makefile_ori        2022-10-16 21:19:22.498824000 +0200
+++ Makefile    2022-10-16 22:40:53.787014000 +0200
@@ -22,10 +22,17 @@
        CFLAGS   += -pthread
        CXXFLAGS += -pthread
 endif
+ifeq ($(UNAME_S),FreeBSD)
+       CFLAGS   += -pthread
+       CXXFLAGS += -pthread
+endif
 
 # Architecture specific
 # TODO: probably these flags need to be tweaked on some architectures
 ifeq ($(UNAME_M),x86_64)
+       CFLAGS += -mavx -mavx2 -mfma -mf16c
+endif
+ifeq ($(UNAME_M),amd64)
        CFLAGS += -mavx -mavx2 -mfma -mf16c
 endif
 ifneq ($(filter arm%,$(UNAME_M)),)

(don't know gmake-Makefiles too much, could be prettier with logical or here ...)

--- ggml.c_ori  2022-10-16 21:19:22.502786000 +0200
+++ ggml.c      2022-10-16 21:28:00.140594000 +0200
@@ -2,7 +2,7 @@

 #if defined(_MSC_VER) || defined(__MINGW32__)
 #include <malloc.h> // using malloc.h with MSC/MINGW
-#else
+#elif !defined(__FreeBSD__)
 #include <alloca.h>
 #endif

Seems not so hard to merge changes into upstream ...

For downloading models ftp/wget is needed.

Kind regards,
abelbabel

Token decoding issue - some characters are missing

./main -m models/ggml-medium.bin -l zh -f ~/Movies/samplecn16k.wav

output

[00:00.000 --> 00:16.000]  元����,其实就是����世界,而且要用����世界这个��来定��元����的话,要比元����本身更加����。到这里就出现问题了。那它为什么不叫����世界呢?最��单的原因就是,����世界这个说法大家已经听��了,而元������得更为新��,又包��成为了一个新的概念。
[00:16.000 --> 00:44.000]  现在的元����技��,����没有我们想象中那么先进。按照目前世界第一元����公司,Roblox公司对于元����的定��来看,它起��要具��8个要素,分别是身份、社交、成进、����、多元、��地、经��、文明。身份就是一个����身份,��现实中的角色无关,这个比��好理解。社交也就是社交系��。成进就是感知����的升��,要做到和现实世界的体��完全相同。����就��������,不会有卡��,多元就多元化,

with OpenAI whisper cli

whisper --language zh ~/Movies/samplecn16k.wav
[00:00.000 --> 00:01.760] 元宇宙其实就虚拟世界
[00:01.760 --> 00:04.400] 而且要用虚拟世界这个词来定义元宇宙的话
[00:04.400 --> 00:06.400] 要比元宇宙本身更加准确
[00:06.400 --> 00:07.680] 但这里就出现问题了
[00:07.680 --> 00:09.360] 那它为什么不叫虚拟世界呢?
[00:09.360 --> 00:10.720] 最简单的原因就是
[00:10.720 --> 00:12.880] 虚拟世界这个说法大家已经听腻了
[00:12.880 --> 00:14.320] 而元宇宙显得更为吸引
[00:14.320 --> 00:16.200] 又包装成为了一个新的概念
[00:16.200 --> 00:17.440] 现在的元宇宙技术
[00:17.440 --> 00:19.160] 原有没有我们想象中那么先进
[00:19.160 --> 00:21.320] 按照目前世界第一元宇宙公司
[00:21.320 --> 00:23.480] 罗布洛克斯公司对于元宇宙的定义来看
[00:23.480 --> 00:25.080] 它起码要具备8个要素
[00:25.080 --> 00:30.680] 分别是身份、社交、成敬、延迟、多元、随地、经济、文明
[00:30.680 --> 00:32.280] 身份就是一个虚拟身份
[00:32.280 --> 00:33.640] 与现实中的角色无关
[00:33.640 --> 00:34.640] 这个比较好理解
[00:34.640 --> 00:36.200] 社交也就是社交系统
[00:36.200 --> 00:38.320] 成敬就是感知设备的升级
[00:38.320 --> 00:40.800] 要做到和现实世界的体验完全相同
[00:40.800 --> 00:42.080] 延迟就网络延迟
[00:42.080 --> 00:43.080] 不会有卡顿
[00:43.080 --> 00:44.200] 多元就多元化
[00:44.200 --> 00:45.600] 比如可以在里面玩游戏

Request to support aarch64

Make errors out on a aarch64 server

make base.en
#gcc -pthread -O3 -c ggml.c
gcc -pthread -O3 -mcpu=cortex-a72 -mfloat-abi=hard -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access -c ggml.c
gcc: error: unrecognized command-line option ‘-mfloat-abi=hard’
gcc: error: unrecognized command-line option ‘-mfpu=neon-fp-armv8’
gcc: error: unrecognized command-line option ‘-mfp16-format=ieee’
gcc: error: unrecognized command-line option ‘-mno-unaligned-access’
make: *** [Makefile:7: ggml.o] Error 1

Perhaps this is enough as C flags? : -Ofast -g -mfpu=neon

What's the build process for Windows?

I tried running "make" and got this error:

process_begin: CreateProcess(NULL, uname -s, ...) failed.
process_begin: CreateProcess(NULL, uname -p, ...) failed.
process_begin: CreateProcess(NULL, uname -m, ...) failed.
cc  -O3 -std=c11   -Wall -Wextra -Wno-unused-parameter -Wno-unused-function   -c ggml.c
process_begin: CreateProcess(NULL, cc -O3 -std=c11 -Wall -Wextra -Wno-unused-parameter -Wno-unused-function -c ggml.c, ...) failed.
make (e=2): The system cannot find the file specified.
make: *** [ggml.o] Error 2

Could someone guide me through building this program on Windows? Are there pre-built binaries available? I have Visual Studio 2022 and MinGW installed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.