Comments (4)
Thanks for the report @bjnortier :) I just tested:
1-) whisper.cpp + --beam-size 1
: Produces the first sentence from the result you shared and early stops
2-) whisper.cpp + --beam-size 5
: Almost the same result as you reported above
3-) WhisperKit (always) --beam-size 1
: Halts early
We will investigate whether an aggressive hallucination guardrail or some input audio processing is causing this discrepancy for this particular sample and report back.
In general, we do not anticipate any quality drop when moving from whisper.cpp to WhisperKit because the evaluation results on librispeech are almost identical and whisper.cpp has higher error rates on earnings22 (Eval results published here, whisper.cpp is not yet tabulated but eval files are in this repo)
from whisperkit.
Yes, definitely a beam search thing, thanks for this great example. The model wants to predict no speech in the beginning because the audio is quiet, and stops early as @atiorh mentioned. You can get around this somewhat by forcing the first tokens i.e. telling the model there is definitely speech here, by using the decoding option --use-prefill-prompt
, but there is still some missing speech.
(crowd shouting) And our walls and shattered shields when the Egypt man comes crashing down. But it is not this day. This day we fight. By all that you hold dear, this good earth, I bid you stand, hand over west. (shouting)
I managed to improve it a bit by scaling the loudness and a simple band pass filter for human speech frequency ranges using this file and this command
swift run whisperkit-cli transcribe --model small --audio-path ~/Downloads/aragorn_loudest_filtered.wav --verbose --word-timestamps --use-prefill-prompt --temperature=0.2
Hold your breath! Hold your breath! Back to Gondor! Abroja! My brothers! I see in your eyes the same fear that would take the heart of me. But they may come when the courage of men fails, when we forsake our friends and break all bonds of fellowship. But it is not this day. An hour of wolves and shattered shields when the age of men comes crashing down. But it is not this day. This day we fight. By all that you hold dear, this good earth I bid you stand, head for the west. (drill whirring)
Will look into what we can clean up on the audio side by default, but beam search will require some reworking with the model, CoreML has some limitations that prevented us from implementing this originally.
from whisperkit.
If you notice more discrepancies, please keep flagging
from whisperkit.
Ok interesting, thanks for the info!
from whisperkit.
Related Issues (20)
- Indeterminate visionOS tests HOT 3
- Enable word timestamps for distil-large-v3 HOT 1
- Speculative decoding support with Eager streaming mode
- Disallow invalid `--language` values HOT 1
- Use `config.json` for device support filtering
- Incorrect timestamps (0.5sec off) HOT 7
- When transcribing non english audio files, I get results always translated in english :( Even though it's correct but not in the original language. HOT 1
- Clarify the translation capabilities in sample App
- How to use custom prompts? Couldn't find the usage from the examples. HOT 2
- English text normalization utilization for Eager Streaming Mode HOT 1
- @atiorh Today, I tested the latest version of the WhisperAX app consistently crashes when loading the 'openai_whisper-large-v3_947MB' file. HOT 1
- Implement test data-driven `unsupportedModelDeviceCombination` at init HOT 2
- Standard output while processing. HOT 4
- Can a local model be used without requesting the Hugging Face API? HOT 3
- How do I use a parameter like initial_prompt in Python's Whisper? HOT 1
- When my Mac connects to AirPods, starting recording fails. HOT 6
- ~/Documents/huggingface directory gets created even when using `--model-path` in `whisperkit-cli` HOT 2
- Implement async batch predictions HOT 1
- Issue with languages other than English HOT 13
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from whisperkit.