const-me / whisper Goto Github PK

High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model

License: Mozilla Public License 2.0

C++ 68.27% C 21.08% HLSL 4.86% C# 5.49% Batchfile 0.10% PowerShell 0.20%

whisper's Introduction

This project is a Windows port of the whisper.cpp implementation.
Which in turn is a C++ port of OpenAI's Whisper automatic speech recognition (ASR) model.

Quick Start Guide

Download WhisperDesktop.zip from the “Releases” section of this repository, unpack the ZIP, and run WhisperDesktop.exe.

On the first screen it will ask you to download a model.
I recommend ggml-medium.bin (1.42GB in size), because I’ve mostly tested the software with that model.

The next screen allows to transcribe an audio file.

There’s another screen which allows to capture and transcribe or translate live audio from a microphone.

Features

Vendor-agnostic GPGPU based on DirectCompute; another name for that technology is “compute shaders in Direct3D 11”
Plain C++ implementation, no runtime dependencies except essential OS components
Much faster than OpenAI’s implementation.
On my desktop computer with GeForce 1080Ti GPU, medium model, 3:24 min speech took 45 seconds to transcribe with PyTorch and CUDA, but only 19 seconds with my implementation and DirectCompute.
Funfact: that’s 9.63 gigabytes runtime dependencies, versus 431 kilobytes Whisper.dll
Mixed F16 / F32 precision: Windows requires support of R16_FLOAT buffers since D3D version 10.0
Built-in performance profiler which measures execution time of individual compute shaders
Low memory usage
Media Foundation for audio handling, supports most audio and video formats (with the notable exception of Ogg Vorbis), and most audio capture devices which work on Windows (except some professional ones, which only implementing ASIO API).
Voice activity detection for audio capture.
The implementation is based on the 2009 article “A simple but efficient real-time voice activity detection algorithm” by Mohammad Moattar and Mahdi Homayoonpoor.
Easy to use COM-style API. Idiomatic C# wrapper available on nuget.
Version 1.10 introduced scripting support for PowerShell 5.1, that’s the older “Windows PowerShell” version which comes pre-installed on Windows.
Pre-built binaries available

The only supported platform is 64-bit Windows.
Should work on Windows 8.1 or newer, but I have only tested on Windows 10.
The library requires a Direct3D 11.0 capable GPU, which in 2023 simply means “any hardware GPU”. The most recent GPU without D3D 11.0 support was Intel Sandy Bridge from 2011.

On the CPU side, the library requires AVX1 and F16C support.

Developer Guide

Build Instructions

Clone this repository
Open WhisperCpp.sln in Visual Studio 2022. I’m using the freeware community edition, version 17.4.4.
Switch to Release configuration
Build and run CompressShaders C# project, in the Tools subfolder of the solution. To run that project, right click in visual studio, “Set as startup project”, then in the main menu of VS “Debug / Start Without Debugging”. When completed successfully, you should see a console window with a line like that:
Compressed 46 compute shaders, 123.5 kb -> 18.0 kb
Build Whisper project to get the native DLL, or WhisperNet for the C# wrapper and nuget package, or the examples.

Other Notes

If you gonna consume the library in a software built with Visual C++ 2022 or newer, you probably redistribute Visual C++ runtime DLLs in the form of the .msm merge module, or vc_redist.x64.exe binary.
If you do that, right click on the Whisper project, Properties, C/C++, Code Generation, switch “Runtime Library” setting from Multi-threaded (/MT) to Multi-threaded DLL (/MD), and rebuild: the binary will become smaller.

The library includes RenderDoc GPU debugger integration.
When launched your program from RenderDoc, hold F12 key to capture the compute calls.
If you gonna debug HLSL shaders, use the debug build of the DLL, it includes debug build of the shaders and you’ll get better UX in the debugger.

The repository includes a lot of code which was only used for development: couple alternative model implementations, compatible FP64 versions of some compute shaders, debug tracing and the tool to compare the traces, etc.
That stuff is disabled by preprocessor macros or constexpr flags, I hope it’s fine to keep here.

Performance Notes

I have a limited selection of GPUs in this house.
Specifically, I have optimized for nVidia 1080Ti, Radeon Vega 8 inside Ryzen 7 5700G, and Radeon Vega 7 inside Ryzen 5 5600U.
Here’s the summary.

The nVidia delivers relative speed 5.8 for the large model, 10.6 for the medium model.
The AMD Ryzen 5 5600U APU delivers relative speed about 2.2 for the medium model. Not great, but still, much faster than realtime.

I have also tested on nVidia 1650: slower than 1080Ti but pretty good, much faster than realtime.
I have also tested on Intel HD Graphics 4000 inside Core i7-3612QM, the relative speed was 0.14 for medium model, 0.44 for small model. That’s much slower than realtime, but I was happy to find my software works even on the integrated mobile GPU launched in 2012.

I’m not sure the performance is ideal on discrete AMD GPUs, or integrated Intel GPUs, have not specifically optimized for them.
Ideally, they might need slightly different builds of a couple of the most expensive compute shaders, mulMatTiled.hlsl and mulMatByRowTiled.hlsl
And maybe other adjustments, like the useReshapedMatMul() value in Whisper/D3D/device.h header file.

I don’t know how to measure that, but I have a feeling the bottleneck is memory, not compute.
Someone on Hacker News has tested on 3060Ti, the version with GDDR6 memory. Compared to 1080Ti, that GPU has 1.3x FP32 FLOPS, but 0.92x VRAM bandwidth. The app was about 10% slower on the 3060Ti.

Further Optimisations

I have only spent a few days optimizing performance of these shaders.
It might be possible to do much better, here’s a few ideas.

Newer GPUs like Radeon Vega or nVidia 1650 have higher FP16 performance compared to FP32, yet my compute shaders are only using FP32 data type.
Half The Precision, Twice The Fun
In the current version, FP16 tensors are using shader resource views to upcast loaded values, and unordered access views to downcast stored ones.
Might be a good idea to switch to byte address buffers, load/store complete 4-bytes values, and upcast / downcast in HLSL with f16tof32 / f32tof16 intrinsics.
In the current version all shaders are compiled offline, and Whisper.dll includes DXBC byte codes.
The HLSL compiler D3DCompiler_47.dll is an OS component, and is pretty fast. For the expensive compute shaders, it’s probably a good idea to ship HLSL instead of DXBC, and compile on startup with environment-specific values for the macros.
It might be a good idea to upgrade the whole thing from D3D11 to D3D12.
The newer API is harder to use, but includes potentially useful features not exposed to D3D11: wave intrinsics, and explicit FP16.

Missing Features

Automatic language detection is not implemented.

In the current version there’s high latency for realtime audio capture.
Specifically, depending on voice detection the figure is about 5-10 seconds.
At least in my tests, the model wasn’t happy when I supplied too short pieces of the audio.
I have increased the latency and called it a day, but ideally this needs a better fix for optimal UX.

Final Words

From my perspective, this is an unpaid hobby project, which I completed over the 2022-23 winter holydays.
The code probably has bugs.
The software is provided “as is”, without warranty of any kind.

Thanks to Georgi Gerganov for whisper.cpp implementation, and the models in GGML binary format.
I don’t program Python, and I don’t know anything about the ML ecosystem.
I wouldn’t even start this project without a good C++ reference implementation, to test my version against.

That whisper.cpp project has an example which uses the same GGML implementation to run another OpenAI’s model, GPT-2.
It shouldn’t be hard to support that ML model with the compute shaders and relevant infrastructure already implemented in this project.

If you find this useful, I’ll be very grateful if you consider a donation to “Come Back Alive” foundation.

whisper's People

Contributors

Stargazers

Watchers

Forkers

nasa03 paradoxewan dkakaie shiller79 cafew kustomzone jimbog wmsmigiel entn-at yum-food ukaserge udplus jasonmoriarty contropist onenovice leiyanhua colosieve chrisluwtw crackxia gazedreamily petercao jaybinks yihezj caoruidong-1979 zhuzhu799 move3390 tagma067001068 icephonix yply alexcaffrey cenerd ryuusennka heycms freeman2010524 minzax ayuyyae zy1911 game418000 nihao2984 qq883975 coolboylc iwwork linjet2022 netlovehf iuriimattos2 andyldl charleschen2006 root-1404 lawshon jamesyoungsss webzone taigr898 techventurebuilder lion8418 asdlei99 cysk003 wangruithu sexychina tsirtv russ168 chengood5000 sonyeric ataberk-koc sangenc limour-dev mididj-trap fulldb anbusport ilater ado5 v4uwixn dimq1 lazzman bkcsplayer syaofox longyuzichen ximliu dingxudong2017 ezhangle qiuchsh gptv spicylentils zhengzaixiazai afirez feiyunwill chenmaojun yzhvmp afreeliu robotpin hyjump thebreadguy pr-ai-lab learncv hymanyese readmecode avgirl checksummaster pschen guonetnet51 zanilia5210

whisper's Issues

I tried it on a long form video pocast which is about 4.5 hours long and foud lots of punctuaiton missing

Dear Sir, I am really impressive with your application.
Today I tried to generate txt transcript for a 4.5 hours podcast of Huberman Lab and it was really fast.
I checked the transcript and most of the punctuations are good except when it come to the end of the text.
Lots of puncations are missing. Maybe I should split the video file into two and then try again. Thanks!

Feature Request: Auto Populate output path based on input

Hello

Great program, been testing it for a few days now and am loving it! One small feature request I'd like to make is if it would be possible to auto-infer + populate the output path and filename based on the input.

Big thanks on the great proj!

Some problem

My laptop is i7+RXT3050, and when I use the software, the GPU is not working, while Intel's integrated graphics are full power

Feature request: stop transcribing button

What about a "stop transcribing" button?

Provide raw audio samples from .NET

Not really an issuel; merely curious.
One if MF helpers simply accepts file path, however providing audio samples directly would also be nice, like when recording audio from .NET. Is it planned?

Support more file types in the picker

Currently .m4a files do not show in the file picker. I guess they do not fall under "multimedia files". When I put in the path to file manually it works fine.

Thank you for your work.

"Hybrid" mode is not working

Other than "GPU" mode, I could not run other modes such as "Hybrid". Throw error as missing some "DLLs" for both other two modes. How can we fix this and what are the main purposes of these modes as could not find any explanation for them? By the way, very good project and results are great!

Output timing info to text files

The transcription in subtitle format and displayed in the terminal shows timing info eg [00:00:00 --> 00:00:02] words. Text files contain only line breaks. It would be great if they could show the same information displayed in the terminal.

Great project btw

transcription does not terminate at eof

first, thank you for your great work.
In certain cases the transcription process does not terminate at the end of the file and does
not write to the txt/vtt/srt file. My testfile was a 48kHz/stereo/mp3 file. If I convert it to
16bit wav as expected by the original whisper.cpp it works fine.

Application Audio Capture

Hello! Thanks for putting together this project, performance on my (non Ti) 1080 seems remarkably good!

One thing I'd love to see is the ability to stream capture audio from an application rather than from a microphone. Something like https://github.com/bozbez/win-capture-audio ?

Cheers!

encoding errors on example program

Thank you for your contribution. The directml version of whisper is much faster than the pure cpu version of whisper.cpp. And I had some issues when using it. The first one was the encoding problem. In the debugoutput of the desktop version, the output content sometimes lacked a few characters. I think it was a conversion from utf-8 to CP_ACP (windows936 gb2312-80). Similar encoding errors also happened in the cli, and the output dictation text was almost unreadable '?'
The second issue is that almost every audio that is transcribed will report the error “runFullImpl: failed to generate timestamp token - skipping one second”.
The third problem is similar to #18, it always stops working after recognizing for a period of time, and repeatedly outputs the last sentence of recognition content.
If you plan to track the last two problems, I can open another issue

Build issues

Tried building the entire solution in VS 2019 and got errors that led me to use VS 2022 (because it needs .NET 6, presumably).

Tried using VS 2022 and got:

Build started...
1>------ Build started: Project: PerfSummary, Configuration: Debug Any CPU ------
2>------ Build started: Project: compareTraces, Configuration: Debug x64 ------
3>------ Build started: Project: OldMain, Configuration: Debug x64 ------
4>------ Build started: Project: ComputeShaders, Configuration: Debug x64 ------
5>------ Build started: Project: ComLightLib, Configuration: Debug x64 ------
2>stdafx.cpp
3>ggmlMsvc.c
3>C:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.33.31629\include\xmmintrin.h(79,10): fatal error C1083: Cannot open include file: 'malloc.h': No such file or directory
2>C:\Users\smbik\Desktop\Git\Whisper\Tools\compareTraces\stdafx.h(3,10): fatal error C1083: Cannot open include file: 'assert.h': No such file or directory
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\addRows.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\addRepeat64.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\addRepeatEx.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\addRepeat.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\convolutionMain2.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\convolutionPrep1.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\copyTranspose.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\addRepeatGelu64.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\convolutionPrep2.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\addRepeatGelu.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\addRepeatScale.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\addInPlace.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\add.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\copyConvert.cso
4>C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\convolutionMain.hlsl(34,4-29): warning X3557: loop doesn't seem to do anything, forcing loop to unroll
4>C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\convolutionMain.hlsl(34,4-29): warning X3557: loop doesn't seem to do anything, forcing loop to unroll
4>
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\convolutionMain.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\convolutionMain2Fixed.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\flashAttentionCompat2.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\diagMaskInf.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\fmaRepeat1.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\flashAttentionCompat1.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\fmaRepeat164.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\fmaRepeat2.cso
3>Done building project "OldMain.vcxproj" -- FAILED.
2>Done building project "compareTraces.vcxproj" -- FAILED.
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\flashAttention.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\mulMatByScalar.cso
4>C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\mulMatByRow64.hlsl(37,3-70): warning X3557: loop only executes for 1 iteration(s), forcing loop to unroll
4>
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\mulMatByRow64.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\flashAttentionCompat3.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\mulMatDotMain.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\matReshapePanels.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\mulMatByRowTiledEx.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\mulMatByRow.cso
5>freeThreadedMarshaller.cpp
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\mulMatDotReshape.cso
5>C:\Program Files (x86)\Windows Kits\10\Include\10.0.19041.0\um\winnt.h(34,10): fatal error C1083: Cannot open include file: 'ctype.h': No such file or directory
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\normCompat.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\norm.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\scaleInPlace.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\mulMatMadMain.cso
5>Done building project "ComLightLib.vcxproj" -- FAILED.
6>------ Build started: Project: Whisper, Configuration: Debug x64 ------
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\normFixed.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\softMax64.cso
4>C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\normFixed.hlsl(31,3-70): warning X3557: loop only executes for 1 iteration(s), forcing loop to unroll
4>C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\normFixed.hlsl(31,3-70): warning X3557: loop only executes for 1 iteration(s), forcing loop to unroll
4>
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\normFixed64.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\softMax.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\softMaxLong.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\softMaxCompat.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\zeroMemory.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\softMaxFixed.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\mulMatByRowTiled.cso
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\mulMatTiledEx.cso
6>stdafx.cpp
6>C:\Users\smbik\Desktop\Git\Whisper\Whisper\stdafx.h(4,10): fatal error C1083: Cannot open include file: 'assert.h': No such file or directory
6>Done building project "Whisper.vcxproj" -- FAILED.
7>------ Build started: Project: WhisperDesktop, Configuration: Debug x64 ------
8>------ Build started: Project: main, Configuration: Debug x64 ------
9>------ Build started: Project: WhisperNet, Configuration: Debug Any CPU ------
4>compilation object save succeeded; see C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\mulMatTiled.cso
1>PerfSummary -> C:\Users\smbik\Desktop\Git\Whisper\Tools\PerfSummary\bin\Debug\PerfSummary.dll
4>ComputeShaders.cpp
9>WhisperNet -> C:\Users\smbik\Desktop\Git\Whisper\WhisperNet\bin\Debug\WhisperNet.dll
10>------ Build started: Project: MicrophoneCS, Configuration: Debug x64 ------
11>------ Build started: Project: TranscribeCS, Configuration: Debug x64 ------
4>ComputeShaders.vcxproj -> C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\x64\Debug\ComputeShaders.lib
8>useDiscreteGpu.c
7>stdafx.cpp
4>Done building project "ComputeShaders.vcxproj".
12>------ Build started: Project: CompressShaders, Configuration: Debug Any CPU ------
7>C:\Program Files (x86)\Windows Kits\10\Include\10.0.19041.0\um\winnt.h(34,10): fatal error C1083: Cannot open include file: 'ctype.h': No such file or directory
7>Done building project "WhisperDesktop.vcxproj" -- FAILED.
8>main.cpp
8>C:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.33.31629\include\cstdlib(12,10): fatal error C1083: Cannot open include file: 'math.h': No such file or directory
8>miscUtils.cpp
8>C:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.33.31629\include\yvals.h(12,10): fatal error C1083: Cannot open include file: 'crtdbg.h': No such file or directory
8>params.cpp
10>C:\Program Files\Microsoft Visual Studio\2022\Professional\MSBuild\Current\Bin\amd64\Microsoft.Common.CurrentVersion.targets(5097,5): error MSB3030: Could not copy the file "C:\Users\smbik\Desktop\Git\Whisper\x64\Debug\Whisper.dll" because it was not found.
8>C:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.33.31629\include\cstdlib(12,10): fatal error C1083: Cannot open include file: 'math.h': No such file or directory
8>textWriter.cpp
8>C:\Program Files (x86)\Windows Kits\10\Include\10.0.19041.0\um\winnt.h(34,10): fatal error C1083: Cannot open include file: 'ctype.h': No such file or directory
8>Generating Code...
8>Done building project "main.vcxproj" -- FAILED.
10>Done building project "MicrophoneCS.csproj" -- FAILED.
11>C:\Program Files\Microsoft Visual Studio\2022\Professional\MSBuild\Current\Bin\amd64\Microsoft.Common.CurrentVersion.targets(5097,5): error MSB3030: Could not copy the file "C:\Users\smbik\Desktop\Git\Whisper\x64\Debug\Whisper.dll" because it was not found.
11>Done building project "TranscribeCS.csproj" -- FAILED.
12>CompressShaders -> C:\Users\smbik\Desktop\Git\Whisper\Tools\CompressShaders\bin\Debug\CompressShaders.dll
========== Build: 4 succeeded, 8 failed, 0 up-to-date, 0 skipped ==========

The error list:

Severity Code Description Project File Line Suppression State
Error C1083 Cannot open include file: 'malloc.h': No such file or directory OldMain C:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.33.31629\include\xmmintrin.h 79
Error C1083 Cannot open include file: 'assert.h': No such file or directory compareTraces C:\Users\smbik\Desktop\Git\Whisper\Tools\compareTraces\stdafx.h 3
Warning X3557 loop doesn't seem to do anything, forcing loop to unroll ComputeShaders C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\convolutionMain.hlsl 34
Warning X3557 loop doesn't seem to do anything, forcing loop to unroll ComputeShaders C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\convolutionMain.hlsl 34
Warning X3557 loop only executes for 1 iteration(s), forcing loop to unroll ComputeShaders C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\mulMatByRow64.hlsl 37
Error C1083 Cannot open include file: 'ctype.h': No such file or directory ComLightLib C:\Program Files (x86)\Windows Kits\10\Include\10.0.19041.0\um\winnt.h 34
Warning X3557 loop only executes for 1 iteration(s), forcing loop to unroll ComputeShaders C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\normFixed.hlsl 31
Warning X3557 loop only executes for 1 iteration(s), forcing loop to unroll ComputeShaders C:\Users\smbik\Desktop\Git\Whisper\ComputeShaders\normFixed.hlsl 31
Error C1083 Cannot open include file: 'assert.h': No such file or directory Whisper C:\Users\smbik\Desktop\Git\Whisper\Whisper\stdafx.h 4
Error C1083 Cannot open include file: 'ctype.h': No such file or directory WhisperDesktop C:\Program Files (x86)\Windows Kits\10\Include\10.0.19041.0\um\winnt.h 34
Error C1083 Cannot open include file: 'math.h': No such file or directory main C:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.33.31629\include\cstdlib 12
Error C1083 Cannot open include file: 'crtdbg.h': No such file or directory main C:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.33.31629\include\yvals.h 12
Error C1083 Cannot open include file: 'math.h': No such file or directory main C:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.33.31629\include\cstdlib 12
Error C1083 Cannot open include file: 'ctype.h': No such file or directory main C:\Program Files (x86)\Windows Kits\10\Include\10.0.19041.0\um\winnt.h 34
Error MSB3030 Could not copy the file "C:\Users\smbik\Desktop\Git\Whisper\x64\Debug\Whisper.dll" because it was not found. MicrophoneCS C:\Program Files\Microsoft Visual Studio\2022\Professional\MSBuild\Current\Bin\amd64\Microsoft.Common.CurrentVersion.targets 5097
Error MSB3030 Could not copy the file "C:\Users\smbik\Desktop\Git\Whisper\x64\Debug\Whisper.dll" because it was not found. TranscribeCS C:\Program Files\Microsoft Visual Studio\2022\Professional\MSBuild\Current\Bin\amd64\Microsoft.Common.CurrentVersion.targets 5097

I'll check the errors one by one but it seems odd it can't build out of the box...thanks!

Is there a problem with the GPU call logic

Excuse me, why does the application still call the integrated graphics card of the system to work when the P100 graphics card has been installed

Cannot locate literal strings that are output during detection/transcription

I notice that sometimes the following strings (amongst others) will appear while it's transcribing:

[BLANK_AUDIO]
[MUSIC]
[VIDEO STARTING]
etc.

I did a search for these strings and they do not appear to be in the code base. I extended the search to my C: drive and could not find them anywhere. My search facility may have skipped some files due to format (dlls and exes)...

Can you tell me where they originate from?

Thanks!

Doesn't support AC3 format?

Building error: shaderData-Release.inl

Unsafe build tags

I just updated the dependency to 1.2.0 and first time building produced this error: The package product 'whisper' cannot be used as a dependency of this target because it uses unsafe build flags.

Haven't received this on older versions, have you run into this with the latest build?

running multiple concurrent threads (each with their own model+context) throws exceptions

I tried running 2 threads - each allocating its own model and context - but calling context.runFull() from both threads at the same time causes exceptions.

Is this tested/supported? My GPU has more than enough VRAM and shaders to run several instances concurrently. This would be highly useful since it would increase throughput anywhere from 2x to 20x depending on GPU.

voice recognition problem

Submit a recognition problem. If a video has only human voices at the beginning and end, and no voice in the middle, the human voice at the end will not be recognized correctly during recognition, and it will be directly recognized as no voice.

is it possible to add Add no_speech_threshold, logprob_threshold and compression_ratio_threshold on the ui and let user to change value

Linux port

Thanks for your great job!
Any chances to see porting on linux in roadmap?

Multilingual recognition in one file

Is possible to make multilingual recognition in one record like in a python version? I mean, when in one audio people sometimes talk in different languages, python version with model large-v2 decode all languages at once, but in this C++ implementation Whisper write [Spanish] or [Different language] etc.

Possible methods to eliminate GPU usage

Hi there,
I'd first like to thank you for writing this code. It's been a godsend to use it as a baseplate, and implementing loopback support via WASAPI was certainly not as bad as it would've been with other codebases.

However, I'm trying to get the model to use less GPU. I understand it's using compute shaders but this method causes noticeable lag when using the GPU for other things such as games. Is there any method for reducing load on the GPU short of fully switching to CPU?

Thanks

Plans for a Simple C interface ?

Are there any plans for a simple C Interface, so folks can wrap your library from Go, Python etc ?

Support for multiple GPU's

My machine has a number of GPU's in it.

NVidia 1050
ATI Radeon RX 580
and the Intel UHD 630 on the CPU

Im not sure what method your software uses to select the GPU, but it doesn't seem I can influence it at all.

Feature request: --output-file option for cli version

It would be handy for processing files in batches to have a --output-file option (like in Georgi Gerganov's tool) to specify a path+prefix for output files.

Audio capture cannot find devices

For some reason the Whisper Desktop application cannot find any audio capture device. I have no problem with other apps like discord, firefox, OBS, android emulators, audacity, etc... I have the correct authorizations in the privacy settings of windows to allow apps to use my microphones.
I have both an USB microphone (Samson Q2U) and a virtual cable (VB Audio Virtual Cable) but none is displayed.

I ran the code in Visual Studio and I noticed I get the error "0xe000020b" returned by the function "MFEnumDeviceSources( attrs, &ppDevices, &count )". I don't know what this error mean. I also checked that MFStartup is correctly executed without error.

I really want to do realtime audio translation, I had great success on translating japanese videos and transcribing english videos on a RX 6800XT. It's honestly impressive.

text output looping/repeating (until end)

Also mentioned here:
#23

I ran a few tests on longer video clips (e.g. 2 hours) and mostly it tends to repeat the sentence from a certain point until the end. E.g. after one hour, you see repeated output forever. The timestamps seem to indicate that there was a new text detected but the text content is the same as before.

https://1drv.ms/u/s!AkS-A9Jqq09FgzEX78lvh7SiMAYu?e=C6f8GW

In this example, after about 2 minutes, the sentence repeats: "Jagt mich mal mit frahmen nudelholz".
The exact command that i use:
C:\dev\whisper\Whisper\x64\Release\main.exe -f C:\temp\test.wav -l de -m C:\temp\whisper\ggml-large.bin

I did different tests and cut portions of some affected file and it turns out that it is not caused by the audio content itself because it the affected area will translate just fine if i e.g. cut 1 minute before it but if i leave 2 minutes before, it happens.

In ContextImpl.cpp, i tried to catch "repeated text" by just copying the latest "text" to the heap as soon as it is complete (about line 740) and before that, compare if the text is the same as last time. If yes, seek a little.
lasttext = new std::string(text);

if (0 == strncmp(lasttext->c_str(), text.c_str(), text.length())) {
 logDebug(u8"last text repeated");
 seek += 100;
}

delete lasttext;
lasttext = new std::string(text);

This seems to "workaround the issue" (still needs lot of testing) but i am really not sure if this is the correct way to do it.
Also, a question. Is it correct to do such workarounds there (there is another workaround a few lines above) or should the cause be searched and fixed somewhere else? (where)

Performance notes on AMD (RDNA2)

Not really an issue but I'm giving my thoughts on RDNA2 performance. In short - It's great. On my 6900xt it looks around 7.7x realtime for the large model and for the medium one - 10.5x

There will be a phenomenon that the subtitles are identified and repeated

Is it possible to operate other application software that occupies the CPU or GPU during the conversion?
I am using the large.bin model.....
Occasionally such problems arise.

Adding speaker diarization as a feature

Is it possible in future to add speaker diarization as an option to have labels for who speaks which part of the transcription? Very good project results btw. Thanks

Works with WINE on Linux/Mac thanks to using DirectX 11 instead of 12

For anyone interested: I have tested this library running an app under WINE 8.3 (www.winehq.org) on Linux and it works thanks to NOT requiring DirectX12 (WINE only goes up to 11 at this time). Unless there are performance improvements it might be good to keep targeting the DirectX 11 API for the time being.

The only issue was the use of CreateDecompressor in shaders.cpp because the WINE's cabinet.dll does not support those functions. I had to disable this compression/decompression pair to make things work. Not sure this size matters so much so perhaps it's better to just leave out compression?

Where are the releases?

Sorry to ask but could not find releases link to download for windows executable. Is there any link to it on read.me?

Transcription results produce very short lines of text

I'm using the latest build (1.7) and transcribing Mandarin audio. The output I've been getting from over 10 files fails to include any punctuation and each line is usually ten characters or less. Any idea why this is happening?

Non-deterministic output regarding hyphenation

I transcribed a file and the result included hyphenation. After review of the transcription, it's apparent that hyphens indicate change in speaker. Somehow the model is able to distinguish different voices and separate the back-and-forth of conversation into statements that start with -.

I transcribed a different file, and the result did not have hyphenation.

It's unclear why I got hyphenation in some files, but not others.

Request: Include binary for whispercpp beta v1.1.0

Great work! I did not expect this program to be so fast. One thing that can make it better: to include an additional WhisperDesktop.exe with the updates in the official whispercpp beta v1.1.0. This beta improves the transcription of files that previously got stuck in the same sentence. I hope that this is a reasonable request.

feature request: Transcribe by different model one by one.

By tests, ggml-medium.bin and ggml-large.bin each have their own wins and losses.
So I'd like to get two output txt files just one click.

We can load 2 models at the same time.
Output to filename-large.txt and filename-medium.txt one by one.

Feature request: an option to save all formats

It would be nice to have an option to save all output formats at once (txt with timestamps and various subtitle formats) instead of just one at a time.

Translations

Thanks for this very useful application. Do you think it's possible to add the ability to translate into other languages than English? In my case Italian. Ciao

Language Auto Detection

I cant make language autodetection work.

When using the CLI, it seems to default to English, rather than "auto".
(tested with Chinese audio sample).

The UI has a dropdown without "auto" being an option

MFCreateSourceReaderFromURL failed

\Whisper\Examples\TranscribeCS\bin\x64\Release>TranscribeCS -l zh -ovtt L:\out.vtt -f "L:\1.mp3" -m L:\WhisperDesktop\models\ggml-tiny.bin
Using GPU "NVIDIA GeForce GTX 1070", feature level 12.1, effective flags Wave32 | NoReshapedMatMul
Loaded MEL filters, 62.8 kb RAM
Loaded vocabulary, 51865 strings, 771.3 kb RAM
Loaded 167 GPU tensors, 73.5388 MB VRAM
Computed CPU base frequency: 3.6 GHz
Loaded model from "L:\WhisperDesktop\models\ggml-tiny.bin" to VRAM
StreamFile
False
MFCreateSourceReaderFromURL failed
MFCreateSourceReaderFromURL failed

Mojibake in Debug Window

Like what the upstream issue ggerganov/whisper.cpp#399 issued, when using -pc output in the terminal, some characters cannot be displayed normally.
So maybe there could be an opinion to disable the color print function?

App crash on model loading

I try to load model and app crash without any message. I tried with different models (tiny, small, medium, large) with no result. I use Win10x64. GPU - GTX 970 4 Gb, CPU - i7 3930K, RAM - 28 Gb. Also I tested app on VM with the same result.

Add option for batch transcribing multiple files?

Hi.
Thanks for the great work,
GPU acceleration extremely faster decodes text, than official ggerganov project which doesnt support gpu yet.
Could you please implement option for multiple subtitle generation, to program allow batch import files?
Thank you very much in advance!

A little advice

I would like to add the ability to implement batch file tasks.
Also if real time speech recognition is implemented with low latency, can we do a desktop captioning? That way we can watch videos with real time translation.

Request: Implement llama.cpp as a windows app with D3D acceleration

in the "final words" section of your readme file you talk about the possibility of implementing GPT-2. Now that llama.cpp is mature enough, I think It would be great to have a D3D accelerated version, if you have the time. It seems to use the same ggml.c base with other files to support the quantized LLaMa models.

Thanks for considering it!

Is it possible to batch transcribe files?

Firstly, thank you for sharing your work. I've been using whisper via command line for a couple of months now and your implementation is considerably faster.

Secondly, is it possible to transcribe files in batches? My typical use case is transcribing several video classes overnight in a batch so that they are ready for me to read in the morning.

Support multiple simultaneous outputs

Would be good to have support of simultaneous -otxt, -ovtt and -osrt as well as the regular text.
It is possible with the original CLI whisper.

-ml/--max-len don't seem to work.

Works in ggerganov/whisper.cpp with exact same command line

I get this error:

eFullParamsFlags.TokenTimestamps flag is not supported in streaming mode
Unable to process audio: Not implemented

Stuck in audio capture

I was able to add a model to Whisper. On the next page I selected audio capture. However there seems to be no way to browse to a file instead. I hit back and it goes back to the model selection screen. Close exits the program.

[Edit: figured it out- it is "transcribe file" That sounds like transcribe to file, ie output subtitles. I suggest rewording to "load audio file"]