Comments (13)
@miguel-arrf thanks for the report, which models did you try? Depending on the source audio, the smaller models can have a tough time with transcription. To get Spanish text out, you'll definitely want to use transcribe with the language set to Spanish. If you have a sample audio file we can take a look as well.
from whisperkit.
@miguel-arrf Could you also report the results with Settings > Cache prefill
disabled?
from whisperkit.
I see the same behavior with distil-whisper_distil-large-v3_594MB model. If set language to nil then it translates to English even if task is set to transcribe
from whisperkit.
And I have tried with Cache profile false. The same result. If languageCode is set then all is working as expected. But I wanted to use automatics language detection instead of setting language manually
from whisperkit.
I was wrong, even if set language to specific language distil-whisper_distil-large-v3_594MB return English translation. For Transcribe task
from whisperkit.
Would you mind providing some implementation details such as the code being used? And ideally verbose logs of a case where you see this happening?
from whisperkit.
The code is the same as for example.
Set language to 'ru'. Tried to transcribe "попытка распознать речь". Got transliteration instead of transcribe this time (before that it was translation). All is working well with 'base' model
Log below
[WhisperKit] Loaded audio encoder
[WhisperKit] Loading text decoder
[WhisperKit] Loaded text decoder
[WhisperKit] Loading models from /file:/private/var/containers/Bundle/Application/951645AE-11B1-46E7-BF32-D3BEE7417134/Babelfish.app/models/argmaxinc/whisperkit-coreml/distil-whisper_distil-large-v3_594MB with prewarmMode: false
[WhisperKit] Loading feature extractor
[WhisperKit] Loaded feature extractor
[WhisperKit] Loading audio encoder
[WhisperKit] Loaded audio encoder
[WhisperKit] Loading text decoder
[WhisperKit] Loaded text decoder
[WhisperKit] Loading tokenizer for large-v3
[WhisperKit] Loaded tokenizer
[WhisperKit] Loaded models for whisper size: large-v3
[WhisperKit] Current audio size: 32000 samples, most recent buffer: 1600 samples, most recent energy: (0.6009644, 0.064131714, 0.18530601, 6.959497e-06)
Debug info: transcribeAudioSamples transcribe transcribe ru
Debug info: DecodingOptions(verbose: false, task: transcribe, language: Optional("ru"), temperature: 0.0, temperatureIncrementOnFallback: 0.2, temperatureFallbackCount: 3, sampleLength: 224, topK: 5, usePrefillPrompt: false, usePrefillCache: true, skipSpecialTokens: true, withoutTimestamps: true, wordTimestamps: false, maxInitialTimestamp: nil, clipTimestamps: [0.0], promptTokens: nil, prefixTokens: nil, suppressBlank: false, supressTokens: [], compressionRatioThreshold: Optional(2.4), logProbThreshold: Optional(-1.0), firstTokenLogProbThreshold: Optional(-1.5), noSpeechThreshold: Optional(0.6))
[WhisperKit] Decoder init time: 0.007320046424865723
[WhisperKit] Prefill time: 9.5367431640625e-07
[WhisperKit] Prefill prompt: ["<|startoftranscript|>"]
[WhisperKit] Decoding Seek: 0
[WhisperKit] Decoding 0.0s - 2.8s
[WhisperKit] Decoding with tempeartures [0.0, 0.2, 0.4, 0.5996]
[WhisperKit] Decoding Temperature: 0.0
[WhisperKit] Running main loop for a maximum of 223 iterations, starting at index 0
[WhisperKit] Forcing token 50258 at index 0 from initial prompt
[WhisperKit] --------------- DECODER INPUTS DEBUG ---------------
[WhisperKit] Cache Length: 0 Input Token: 50258
[WhisperKit] Key Cache | Val Cache | Align Cache | Update Mask | Decoder Mask | Position
[WhisperKit] 0.000000 | 0.000000 | 0.000000 | 1 | 0 | 0
[WhisperKit] 0.000000 | 0.000000 | 0.000000 | 0 | -10000 | 1
[WhisperKit] 0.000000 | 0.000000 | 0.000000 | 0 | -10000 | 2
[WhisperKit] 0.000000 | 0.000000 | 0.000000 | 0 | -10000 | 3
[WhisperKit] tokenIndex: 0, token: 50259, word: <|en|>
[WhisperKit] Forcing token 50259 at index 1 from initial prompt
[WhisperKit] --------------- DECODER INPUTS DEBUG ---------------
[WhisperKit] Cache Length: 1 Input Token: 50259
[WhisperKit] Key Cache | Val Cache | Align Cache | Update Mask | Decoder Mask | Position
[WhisperKit] -0.291748 | -0.000732 | 0.000000 | 0 | 0 | 0
[WhisperKit] 0.000000 | 0.000000 | 0.000000 | 1 | 0 | 1
[WhisperKit] 0.000000 | 0.000000 | 0.000000 | 0 | -10000 | 2
[WhisperKit] 0.000000 | 0.000000 | 0.000000 | 0 | -10000 | 3
[WhisperKit] Current audio size: 64000 samples, most recent buffer: 1600 samples, most recent energy: (0.0, 0.0023147985, 0.006751795, 9.011237e-07)
[WhisperKit] tokenIndex: 1, token: 50360, word: <|transcribe|>
[WhisperKit] --------------- DECODER INPUTS DEBUG ---------------
[WhisperKit] Cache Length: 2 Input Token: 50360
[WhisperKit] Key Cache | Val Cache | Align Cache | Update Mask | Decoder Mask | Position
[WhisperKit] -0.291748 | -0.000732 | 0.000000 | 0 | 0 | 0
[WhisperKit] 0.108093 | -0.034027 | 0.000000 | 0 | 0 | 1
[WhisperKit] 0.000000 | 0.000000 | 0.000000 | 1 | 0 | 2
[WhisperKit] 0.000000 | 0.000000 | 0.000000 | 0 | -10000 | 3
[WhisperKit] tokenIndex: 2, token: 50364, word: <|notimestamps|>
[WhisperKit] --------------- DECODER INPUTS DEBUG ---------------
[WhisperKit] Cache Length: 3 Input Token: 50364
[WhisperKit] Key Cache | Val Cache | Align Cache | Update Mask | Decoder Mask | Position
[WhisperKit] -0.291748 | -0.000732 | 0.000000 | 0 | 0 | 0
[WhisperKit] 0.108093 | -0.034027 | 0.000000 | 0 | 0 | 1
[WhisperKit] 0.470703 | 0.410645 | 0.000000 | 0 | 0 | 2
[WhisperKit] 0.000000 | 0.000000 | 0.000000 | 1 | 0 | 3
[WhisperKit] tokenIndex: 3, token: 430, word: P
[WhisperKit] tokenIndex: 4, token: 12059, word: OP
[WhisperKit] tokenIndex: 5, token: 3927, word: IT
[WhisperKit] tokenIndex: 6, token: 15515, word: CA
[WhisperKit] [0.00 --> 2.80] POPITCA
[WhisperKit] ---- Transcription Timings ----
[WhisperKit] Audio Load: 0.00 ms / 1 runs ( 0.00 ms/run) 0.00%
[WhisperKit] Audio Processing: 0.62 ms / 1 runs ( 0.62 ms/run) 0.05%
[WhisperKit] Mels: 115.03 ms / 1 runs ( 115.03 ms/run) 9.63%
[WhisperKit] Encoding: 933.30 ms / 1 runs ( 933.30 ms/run) 78.10%
[WhisperKit] Matrices Init: 7.32 ms / 1 runs ( 7.32 ms/run) 0.61%
[WhisperKit] Prefill: 0.00 ms / 1 runs ( 0.00 ms/run) 0.00%
[WhisperKit] Decoding: 125.32 ms / 7 runs ( 17.90 ms/run) 10.49%
[WhisperKit] Non-inference: 10.90 ms / 7 runs ( 1.56 ms/run) 0.91%
[WhisperKit] - Logit Filtering: 0.03 ms / 7 runs ( 0.00 ms/run) 0.00%
[WhisperKit] - Sampling: 6.59 ms / 7 runs ( 0.94 ms/run) 0.55%
[WhisperKit] - Kv Caching: 3.26 ms / 7 runs ( 0.47 ms/run) 0.27%
[WhisperKit] - Word Timestamps: 0.00 ms / 0 runs ( 0.00 ms/run) 0.00%
[WhisperKit] - Windowing: 0.07 ms / 1 runs ( 0.07 ms/run) 0.01%
[WhisperKit] Fallbacks: 0.00 ms / 0 runs ( 0.00 ms/run) 0.00%
[WhisperKit] Decoding Full Loop: 1187.03 ms / 7 runs ( 169.58 ms/run) 99.33%
[WhisperKit] -------------------------------
[WhisperKit] Model Load Time: 1.63 seconds
[WhisperKit] Inference Duration: 1.19 seconds
[WhisperKit] - Decoding Loop: 1.19 seconds
[WhisperKit] Time to first token: 1.09 seconds
[WhisperKit] Total Tokens: 9
[WhisperKit] Tokens per Second: 5.90 tok/s
[WhisperKit] Real Time Factor: 0.59
[WhisperKit] Fallbacks: 0.0
[WhisperKit] [0.00 --> 2.80] POPITCA
Debug info: transcribeAudioSamples result Optional("en")
segments [WhisperKit.TranscriptionSegment(id: 0, seek: 0, start: 0.0, end: 2.8, text: " POPITCA", tokens: [50258, 50259, 50360, 50364, 430, 12059, 3927, 15515, 50257], tokenLogProbs: [[50258: 0.0], [50259: -0.06732669], [50360: -0.06732669], [50364: -9.059947e-06], [430: -9.059947e-06], [12059: -0.6916409], [3927: -0.6916409], [15515: -2.5611303], [50257: -2.5611303]], temperature: 0.0, avgLogprob: -0.73780155, compressionRatio: 0.8888889, noSpeechProb: 0.0, words: nil)]
Debug info: transcribeAudioSamples transcribe transcribe ru
Debug info: DecodingOptions(verbose: false, task: transcribe, language: Optional("ru"), temperature: 0.0, temperatureIncrementOnFallback: 0.2, temperatureFallbackCount: 3, sampleLength: 224, topK: 5, usePrefillPrompt: false, usePrefillCache: true, skipSpecialTokens: true, withoutTimestamps: true, wordTimestamps: false, maxInitialTimestamp: nil, clipTimestamps: [0.0], promptTokens: nil, prefixTokens: nil, suppressBlank: false, supressTokens: [], compressionRatioThreshold: Optional(2.4), logProbThreshold: Optional(-1.0), firstTokenLogProbThreshold: Optional(-1.5), noSpeechThreshold: Optional(0.6))
[WhisperKit] Decoder init time: 0.0032979249954223633
[WhisperKit] Prefill time: 0.0
[WhisperKit] Prefill prompt: ["<|startoftranscript|>"]
[WhisperKit] Decoding Seek: 0
[WhisperKit] Decoding 0.0s - 4.0s
[WhisperKit] Decoding with tempeartures [0.0, 0.2, 0.4, 0.5996]
[WhisperKit] Decoding Temperature: 0.0
[WhisperKit] Running main loop for a maximum of 223 iterations, starting at index 0
[WhisperKit] Forcing token 50258 at index 0 from initial prompt
[WhisperKit] --------------- DECODER INPUTS DEBUG ---------------
[WhisperKit] Cache Length: 0 Input Token: 50258
[WhisperKit] Key Cache | Val Cache | Align Cache | Update Mask | Decoder Mask | Position
[WhisperKit] 0.000000 | 0.000000 | 0.000000 | 1 | 0 | 0
[WhisperKit] 0.000000 | 0.000000 | 0.000000 | 0 | -10000 | 1
[WhisperKit] 0.000000 | 0.000000 | 0.000000 | 0 | -10000 | 2
[WhisperKit] 0.000000 | 0.000000 | 0.000000 | 0 | -10000 | 3
[WhisperKit] tokenIndex: 0, token: 50259, word: <|en|>
[WhisperKit] Forcing token 50259 at index 1 from initial prompt
[WhisperKit] --------------- DECODER INPUTS DEBUG ---------------
[WhisperKit] Cache Length: 1 Input Token: 50259
[WhisperKit] Key Cache | Val Cache | Align Cache | Update Mask | Decoder Mask | Position
[WhisperKit] -0.291748 | -0.000732 | 0.000000 | 0 | 0 | 0
[WhisperKit] 0.000000 | 0.000000 | 0.000000 | 1 | 0 | 1
[WhisperKit] 0.000000 | 0.000000 | 0.000000 | 0 | -10000 | 2
[WhisperKit] 0.000000 | 0.000000 | 0.000000 | 0 | -10000 | 3
[WhisperKit] tokenIndex: 1, token: 50360, word: <|transcribe|>
[WhisperKit] --------------- DECODER INPUTS DEBUG ---------------
[WhisperKit] Cache Length: 2 Input Token: 50360
[WhisperKit] Key Cache | Val Cache | Align Cache | Update Mask | Decoder Mask | Position
[WhisperKit] -0.291748 | -0.000732 | 0.000000 | 0 | 0 | 0
[WhisperKit] 0.108093 | -0.034027 | 0.000000 | 0 | 0 | 1
[WhisperKit] 0.000000 | 0.000000 | 0.000000 | 1 | 0 | 2
[WhisperKit] 0.000000 | 0.000000 | 0.000000 | 0 | -10000 | 3
[WhisperKit] tokenIndex: 2, token: 50364, word: <|notimestamps|>
[WhisperKit] --------------- DECODER INPUTS DEBUG ---------------
[WhisperKit] Cache Length: 3 Input Token: 50364
[WhisperKit] Key Cache | Val Cache | Align Cache | Update Mask | Decoder Mask | Position
[WhisperKit] -0.291748 | -0.000732 | 0.000000 | 0 | 0 | 0
[WhisperKit] 0.108093 | -0.034027 | 0.000000 | 0 | 0 | 1
[WhisperKit] 0.470703 | 0.410645 | 0.000000 | 0 | 0 | 2
[WhisperKit] 0.000000 | 0.000000 | 0.000000 | 1 | 0 | 3
[WhisperKit] tokenIndex: 3, token: 359, word: -
[WhisperKit] tokenIndex: 4, token: 430, word: P
[WhisperKit] tokenIndex: 5, token: 12059, word: OP
[WhisperKit] tokenIndex: 6, token: 3927, word: IT
[WhisperKit] tokenIndex: 7, token: 34, word: C
[WhisperKit] tokenIndex: 8, token: 497, word: R
[WhisperKit] tokenIndex: 9, token: 3447, word: US
[WhisperKit] tokenIndex: 10, token: 42, word: K
[WhisperKit] tokenIndex: 11, token: 497, word: R
[WhisperKit] Early stopping
[WhisperKit] [0.00 --> 4.00] - POPITC RUSK R
[WhisperKit] ---- Transcription Timings ----
[WhisperKit] Audio Load: 0.00 ms / 1 runs ( 0.00 ms/run) 0.00%
[WhisperKit] Audio Processing: 0.19 ms / 1 runs ( 0.19 ms/run) 0.02%
[WhisperKit] Mels: 7.51 ms / 1 runs ( 7.51 ms/run) 0.69%
[WhisperKit] Encoding: 906.59 ms / 1 runs ( 906.59 ms/run) 82.77%
[WhisperKit] Matrices Init: 3.30 ms / 1 runs ( 3.30 ms/run) 0.30%
[WhisperKit] Prefill: 0.00 ms / 1 runs ( 0.00 ms/run) 0.00%
[WhisperKit] Decoding: 167.03 ms / 11 runs ( 15.18 ms/run) 15.25%
[WhisperKit] Non-inference: 9.35 ms / 11 runs ( 0.85 ms/run) 0.85%
[WhisperKit] - Logit Filtering: 0.01 ms / 11 runs ( 0.00 ms/run) 0.00%
[WhisperKit] - Sampling: 4.62 ms / 11 runs ( 0.42 ms/run) 0.42%
[WhisperKit] - Kv Caching: 4.38 ms / 12 runs ( 0.37 ms/run) 0.40%
[WhisperKit] - Word Timestamps: 0.00 ms / 0 runs ( 0.00 ms/run) 0.00%
[WhisperKit] - Windowing: 0.05 ms / 1 runs ( 0.05 ms/run) 0.00%
[WhisperKit] Fallbacks: 0.00 ms / 0 runs ( 0.00 ms/run) 0.00%
[WhisperKit] Decoding Full Loop: 1092.04 ms / 11 runs ( 99.28 ms/run) 99.70%
[WhisperKit] -------------------------------
[WhisperKit] Model Load Time: 1.63 seconds
[WhisperKit] Inference Duration: 1.10 seconds
[WhisperKit] - Decoding Loop: 1.09 seconds
[WhisperKit] Time to first token: 0.93 seconds
[WhisperKit] Total Tokens: 14
[WhisperKit] Tokens per Second: 10.07 tok/s
[WhisperKit] Real Time Factor: 0.27
[WhisperKit] Fallbacks: 0.0
[WhisperKit] [0.00 --> 4.00] - POPITC RUSK R
Debug info: transcribeAudioSamples result Optional("en")
segments [WhisperKit.TranscriptionSegment(id: 0, seek: 0, start: 0.0, end: 4.0, text: " - POPITC RUSK R", tokens: [50258, 50259, 50360, 50364, 359, 430, 12059, 3927, 34, 497, 3447, 42, 497, 50257], tokenLogProbs: [[50258: 0.0], [50259: -0.10038931], [50360: -0.10038931], [50364: -1.1324947e-05], [359: -1.1324947e-05], [430: -0.70845985], [12059: -0.70845985], [3927: -2.0009668], [34: -2.0009668], [497: -0.7513129], [3447: -0.7513129], [42: -0.95198274], [497: -0.95198274], [50257: -0.2918289]], temperature: 0.0, avgLogprob: -0.66557676, compressionRatio: 1.2, noSpeechProb: 0.0, words: nil)]
[WhisperKit] Current audio size: 96000 samples, most recent buffer: 1600 samples, most recent energy: (0.07918951, 0.003477635, 0.011624122, 1.846689e-06)
[WhisperKit] Current audio size: 128000 samples, most recent buffer: 1600 samples, most recent energy: (0.11438229, 0.004325367, 0.009467858, 5.3035856e-06)
[WhisperKit] Current audio size: 160000 samples, most recent buffer: 1600 samples, most recent energy: (0.030788798, 0.002832959, 0.010625504, 5.266931e-06)
[WhisperKit] Current audio size: 192000 samples, most recent buffer: 1600 samples, most recent energy: (0.04899593, 0.0031364716, 0.009173056, 6.2020645e-06)
Debug info: loading items 20 10
receiving text from Transcriber (" - POPITC RUSK R", "russian")
from whisperkit.
@miguel-arrf @ArchieGoodwin sorry for the delay, I can see from this log that the prefill prompt is not being used, so it just is trying to transcribe into english by default. In the app, make sure this is enabled, and in CLI it is --use-prefill-prompt
In addition to that the distil model you're using is actually not trained to be multilingual, so it will always output english text.
from whisperkit.
In addition to that the distil model you're using is actually not trained to be multilingual,
Emphasis on this: Definitely use a large-v3
or small
variant for non-English applications.
from whisperkit.
I see, make sense. Thank you
from whisperkit.
But one question still. Is automatic language detect works for example with large-v3 or small models? I can't make it work
from whisperkit.
@ArchieGoodwin I've added a new decoding option in #114 called detectLanguage
which will enforce checking the language regardless of usePrefillPrompt
settings. You may also use it on its own to do a single forward pass and return the language with code like this:
let whisperKit = try await WhisperKit(
modelFolder: tinyModelPath(),
verbose: true,
logLevel: .debug
)
let audioFilePath = try XCTUnwrap(
Bundle.module.path(forResource: "ja_test_clip", ofType: "wav"),
"Audio file not found"
)
// To detect language only, set `sampleLength` to 1 and no prefill prompt
let optionsDetectOnly = DecodingOptions(task: .transcribe, temperatureFallbackCount: 0, sampleLength: 1, detectLanguage: true)
let result = try await XCTUnwrapAsync(
await whisperKit.transcribe(audioPath: audioFilePath, decodeOptions: optionsDetectOnly),
"Failed to transcribe"
)
print(result.language)
from whisperkit.
This works! Thank you @ZachNagengast
from whisperkit.
Related Issues (20)
- Indeterminate visionOS tests HOT 3
- Enable word timestamps for distil-large-v3 HOT 1
- Speculative decoding support with Eager streaming mode
- Disallow invalid `--language` values HOT 1
- Use `config.json` for device support filtering
- Incorrect timestamps (0.5sec off) HOT 7
- When transcribing non english audio files, I get results always translated in english :( Even though it's correct but not in the original language. HOT 1
- Clarify the translation capabilities in sample App
- Major difference with whisper.cpp? HOT 4
- How to use custom prompts? Couldn't find the usage from the examples. HOT 2
- English text normalization utilization for Eager Streaming Mode HOT 1
- @atiorh Today, I tested the latest version of the WhisperAX app consistently crashes when loading the 'openai_whisper-large-v3_947MB' file. HOT 1
- Implement test data-driven `unsupportedModelDeviceCombination` at init HOT 2
- Standard output while processing. HOT 4
- Can a local model be used without requesting the Hugging Face API? HOT 3
- How do I use a parameter like initial_prompt in Python's Whisper? HOT 1
- When my Mac connects to AirPods, starting recording fails. HOT 6
- Problems with "base" model HOT 4
- Audio input captures only the first channel HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from whisperkit.