Giter Club home page Giter Club logo

whisperkit's Introduction

WhisperKit WhisperKit

WhisperKit

Tests License Supported Swift Version Supported Platforms

WhisperKit is a Swift package that integrates OpenAI's popular Whisper speech recognition model with Apple's CoreML framework for efficient, local inference on Apple devices.

Check out the demo app on TestFlight.

[Blog Post] [Python Tools Repo]

Table of Contents

Installation

Swift Package Manager

WhisperKit can be integrated into your Swift project using the Swift Package Manager.

Prerequisites

  • macOS 14.0 or later.
  • Xcode 15.0 or later.

Steps

  1. Open your Swift project in Xcode.
  2. Navigate to File > Add Package Dependencies....
  3. Enter the package repository URL: https://github.com/argmaxinc/whisperkit.
  4. Choose the version range or specific version.
  5. Click Finish to add WhisperKit to your project.

Homebrew

You can install WhisperKit command line app using Homebrew by running the following command:

brew install whisperkit-cli

Getting Started

To get started with WhisperKit, you need to initialize it in your project.

Quick Example

This example demonstrates how to transcribe a local audio file:

import WhisperKit

// Initialize WhisperKit with default settings
Task {
   let pipe = try? await WhisperKit()
   let transcription = try? await pipe!.transcribe(audioPath: "path/to/your/audio.{wav,mp3,m4a,flac}")?.text
    print(transcription)
}

Model Selection

WhisperKit automatically downloads the recommended model for the device if not specified. You can also select a specific model by passing in the model name:

let pipe = try? await WhisperKit(model: "large-v3")

This method also supports glob search, so you can use wildcards to select a model:

let pipe = try? await WhisperKit(model: "distil*large-v3")

Note that the model search must return a single model from the source repo, otherwise an error will be thrown.

For a list of available models, see our HuggingFace repo.

Generating Models

WhisperKit also comes with the supporting repo whisperkittools which lets you create and deploy your own fine tuned versions of Whisper in CoreML format to HuggingFace. Once generated, they can be loaded by simply changing the repo name to the one used to upload the model:

let pipe = try? await WhisperKit(model: "large-v3", modelRepo: "username/your-model-repo")

Swift CLI

The Swift CLI allows for quick testing and debugging outside of an Xcode project. To install it, run the following:

git clone https://github.com/argmaxinc/whisperkit.git
cd whisperkit

Then, setup the environment and download your desired model.

make setup
make download-model MODEL=large-v3

Note:

  1. This will download only the model specified by MODEL (see what's available in our HuggingFace repo, where we use the prefix openai_whisper-{MODEL})
  2. Before running download-model, make sure git-lfs is installed

If you would like download all available models to your local folder, use this command instead:

make download-models

You can then run them via the CLI with:

swift run whisperkit-cli transcribe --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" --audio-path "path/to/your/audio.{wav,mp3,m4a,flac}" 

Which should print a transcription of the audio file. If you would like to stream the audio directly from a microphone, use:

swift run whisperkit-cli transcribe --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" --stream

Contributing & Roadmap

Our goal is to make WhisperKit better and better over time and we'd love your help! Just search the code for "TODO" for a variety of features that are yet to be built. Please refer to our contribution guidelines for submitting issues, pull requests, and coding standards, where we also have a public roadmap of features we are looking forward to building in the future.

License

WhisperKit is released under the MIT License. See LICENSE for more details.

Citation

If you use WhisperKit for something cool or just find it useful, please drop us a note at [email protected]!

If you use WhisperKit for academic work, here is the BibTeX:

@misc{whisperkit-argmax,
   title = {WhisperKit},
   author = {Argmax, Inc.},
   year = {2024},
   URL = {https://github.com/argmaxinc/WhisperKit}
}

whisperkit's People

Contributors

abhinay1997 avatar atiorh avatar bharat9806 avatar cgfarmer4 avatar eltociear avatar finnvoor avatar jkrukowski avatar jordibruin avatar metropol avatar thenameless7741 avatar zachnagengast avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

whisperkit's Issues

Stream with audio output

Thank you for your WORK!!!

I'm not a MacOS developer, but a user. I want to know if it's possible to use the computer's audio output in Stream, not just the microphone. The scenario is similar to simultaneous interpretation in meetings.

I look forward to your reply, thank you again!!!

Duration limit?

Does it have a duration limit? I remember that Whisper limits the input file to 30 seconds, but when I tested it on macOS, the app could handle much longer duration audio files. Do you have to chunk the audio files before transcription?

No Speech Detection

This can be done with logit filters on the first loop, similar to detecting language. However, this cannot be used when we are using a prefill prompt (i.e. forced decoder tokens) so that will need special handling. Ideally, there'd be an option to ignore the prefill prompt for the first decoder loop to detect no speech, which costs 1 extra loop but may allow skipping the entire window if developers are expecting some long stretches of silence in their input audio.

References

Openai implementation: https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/decoding.py#L692-L693
WhisperKit inline todo:

noSpeechProb: 0, // TODO: implement no speech prob

if let threshold = options.noSpeechThreshold,
result.noSpeechProb > threshold
{
needsFallback = false // silence
}

Dependency issue with v0.2.0

Seems like I cannot resolve the packages correctly with 0.20:

swift package update                                                                                                               🌱 main 📝 ×3 via 🐦 v5.9.2
Updating https://github.com/apple/swift-argument-parser
Updating https://github.com/argmaxinc/whisperkit
Updated https://github.com/argmaxinc/whisperkit (0.43s)
Updated https://github.com/apple/swift-argument-parser (0.43s)
Computing version for https://github.com/argmaxinc/whisperkit
error: Dependencies could not be resolved because root depends on 'whisperkit' 0.2.0..<1.0.0.
'whisperkit' >= 0.2.0 cannot be used because no versions of 'whisperkit' match the requirement 0.2.1..<1.0.0 and package 'whisperkit' is required using a stable-version but 'whisperkit' depends on an unstable-version package 'swift-transformers'.

The doc on SPM dependencies says:

packages which use commit-based dependency requirements can't be added as dependencies to packages that use version-based dependency requirements

Streaming Emulation for Files

Needed for benchmarking the streaming functionality, as well as generally testing it's accuracy and performance. A simple loop can be made to read a file in incremental n second chunks, where the audio length increases by n seconds each loop, and the transcription is appended as the audio size increases.

MLX

Hey,
Apple just dropped MLX-Swift, a cross-platform (currently only iOS/macOS) MLX framework. Are there any plans to support it?
Thanks!

show benchmarks

the advantage of this project is that it uses CoreML for a performance gain, so showing benchmarks would solidify how much this advantage is

Using locally saved models

Hey! Thanks for making WhisperKit!

I hope I did not miss it in the documentation. But is it possible to provide a local URL to the model for the WhisperKit instead of relying on its internal mechanism to load the model? Inside my app I already have a nice UI that allows to download, suspend, and cancel download progress so would be nice if I could then feed WhisperKit with the local URL.

If there is no such functionality but you are considering adding it - I might try to help by making a PR.

Thanks.

ReactNative Swift APIs

It would be worth adding support for ReactNative apps using Native Modules and expose Swift APIs to JS.

Want to use AVCaptureSession buffers instead of AVAudioEngine

Hey there!

First off, thanks so much for building this awesome library! Its a total pleasure to use and works great. Looking forward to the Metal update. In the meantime, I was curious if you all would accept a PR to allow for AVCaptureSession to be used in the AudioProcessor class instead of AVAudioEngine.

I was thinking of creating a way to pass in a new setupEngine function that allowed for the captureOutput delegate to be used in place of the installTap function. The reason I want to do this is it makes it easier to change the microphone in app instead of relying on the system default.

  1. Would it make sense to allow for this in the AudioProcessor? If so, Im happy to come up with a clean interface proposal.
  2. If no, perhaps theres a way to override the AudioProcessor class and provide an alternate setupEngine function?

Unable to load models

Hey guys! This looks great, unfortunately I'm having issues loading the models (both in my own code and the sample app)

I'm running this on an M1 Macbook Pro.

Many of the models don't load at all, even when given enough time (the progress bar usually gets stuck around specialization)

I've tried to download the models and manually use it as well, but I'm having trouble loading them as well.

Failed to read model package at file:///Users/puravmanot/Developer/Projects/WhisperTesting/WhisperTesting/whisper_large_v3_turbo. Error: A valid manifest does not exist at path: /Users/puravmanot/Developer/Projects/WhisperTesting/WhisperTesting/whisper_large_v3_turbo/Manifest.json

It also gets stuck sometimes while loading a pre-downloaded model

[WhisperKit] Loading models...

Streaming Microphone for CLI

The CLI executable should be able to stream directly from the microphone, similar to the WhisperAX example app. This enables use cases outside of an Xcode project.

Reference

WhisperAX streaming code:

// MARK: Streaming Logic
func realtimeLoop() {
transcriptionTask = Task {
while isRecording && isTranscribing {
do {
try await transcribeCurrentBuffer()
} catch {
print("Error: \(error.localizedDescription)")
break
}
}
}
}
func stopRealtimeTranscription() {
isTranscribing = false
transcriptionTask?.cancel()
}
func transcribeCurrentBuffer() async throws {
guard let whisperKit = whisperKit else { return }
// Retrieve the current audio buffer from the audio processor
let currentBuffer = whisperKit.audioProcessor.audioSamples
// Calculate the size and duration of the next buffer segment
let nextBufferSize = currentBuffer.count - lastBufferSize
let nextBufferSeconds = Float(nextBufferSize) / Float(WhisperKit.sampleRate)
// Only run the transcribe if the next buffer has at least 1 second of audio
guard nextBufferSeconds > 1 else {
await MainActor.run {
if currentText == "" {
currentText = "Waiting for speech..."
}
}
try await Task.sleep(nanoseconds: 100_000_000) // sleep for 100ms for next buffer
return
}
if useVAD {
// Retrieve the current relative energy values from the audio processor
let currentRelativeEnergy = whisperKit.audioProcessor.relativeEnergy
// Calculate the number of energy values to consider based on the duration of the next buffer
// Each energy value corresponds to 1 buffer length (100ms of audio), hence we divide by 0.1
let energyValuesToConsider = Int(nextBufferSeconds / 0.1)
// Extract the relevant portion of energy values from the currentRelativeEnergy array
let nextBufferEnergies = currentRelativeEnergy.suffix(energyValuesToConsider)
// Determine the number of energy values to check for voice presence
// Considering up to the last 1 second of audio, which translates to 10 energy values
let numberOfValuesToCheck = max(10, nextBufferEnergies.count - 10)
// Check if any of the energy values in the considered range exceed the silence threshold
// This indicates the presence of voice in the buffer
let voiceDetected = nextBufferEnergies.prefix(numberOfValuesToCheck).contains { $0 > Float(silenceThreshold) }
// Only run the transcribe if the next buffer has voice
guard voiceDetected else {
await MainActor.run {
if currentText == "" {
currentText = "Waiting for speech..."
}
}
// if nextBufferSeconds > 30 {
// // This is a completely silent segment of 30s, so we can purge the audio and confirm anything pending
// lastConfirmedSegmentEndSeconds = 0
// whisperKit.audioProcessor.purgeAudioSamples(keepingLast: 2 * WhisperKit.sampleRate) // keep last 2s to include VAD overlap
// currentBuffer = whisperKit.audioProcessor.audioSamples
// lastBufferSize = 0
// confirmedSegments.append(contentsOf: unconfirmedSegments)
// unconfirmedSegments = []
// }
// Sleep for 100ms and check the next buffer
try await Task.sleep(nanoseconds: 100_000_000)
return
}
}
// Run transcribe
lastBufferSize = currentBuffer.count
let transcription = try await transcribeAudioSamples(Array(currentBuffer))
// We need to run this next part on the main thread
await MainActor.run {
currentText = ""
unconfirmedText = []
guard let segments = transcription?.segments else {
return
}
self.tokensPerSecond = transcription?.timings?.tokensPerSecond ?? 0
self.realTimeFactor = transcription?.timings?.realTimeFactor ?? 0
self.firstTokenTime = transcription?.timings?.firstTokenTime ?? 0
self.pipelineStart = transcription?.timings?.pipelineStart ?? 0
self.currentLag = transcription?.timings?.decodingLoop ?? 0
// Logic for moving segments to confirmedSegments
if segments.count > requiredSegmentsForConfirmation {
// Calculate the number of segments to confirm
let numberOfSegmentsToConfirm = segments.count - requiredSegmentsForConfirmation
// Confirm the required number of segments
let confirmedSegmentsArray = Array(segments.prefix(numberOfSegmentsToConfirm))
let remainingSegments = Array(segments.suffix(requiredSegmentsForConfirmation))
// Update lastConfirmedSegmentEnd based on the last confirmed segment
if let lastConfirmedSegment = confirmedSegmentsArray.last, lastConfirmedSegment.end > lastConfirmedSegmentEndSeconds {
lastConfirmedSegmentEndSeconds = lastConfirmedSegment.end
// Add confirmed segments to the confirmedSegments array
if !self.confirmedSegments.contains(confirmedSegmentsArray) {
self.confirmedSegments.append(contentsOf: confirmedSegmentsArray)
}
}
// Update transcriptions to reflect the remaining segments
self.unconfirmedSegments = remainingSegments
} else {
// Handle the case where segments are fewer or equal to required
self.unconfirmedSegments = segments
}
}
}
}

Index out of range error in TextDecoder

Occasionally Im seeing an index out of range crash on the segmentLogProbs[index] after a long period of silence. https://github.com/argmaxinc/WhisperKit/blob/main/Sources/WhisperKit/Core/TextDecoder.swift#L518-L521

Swift/ContiguousArrayBuffer.swift:600: Fatal error: Index out of range

Two ways I could see guarding against this:

  1. Use swift zip
  2. Check the index against segmentLogProbs count.
for (token, logProb) in zip(segmentTokens, segmentLogProbs) {
    tokenProbs.append([token: logProb])
}

for (index, token) in segmentTokens.enumerated() {
  if index < segmentLogProbs.count {
      tokenProbs.append([token: segmentLogProbs[index]])
  }
}

Happy to PR either one but unsure if Im missing a reason for this being as is.

After Steps, I can't start my project.

  1. I created a new project in xcode, named WhisperKit.
  2. I added WhisperKit according to the steps.
  3. I added the following code to WhisperKit/WhisperKit/WhisperKitApp
import SwiftUI
import WhisperKit

@main
struct WhisperKitApp: App {
    init() {
        Task {
            do {
                let pipe = try? await WhisperKit()
                let transcription = try? await pipe!.transcribe(audioPath: "Audio/output-lang.wav")?.text
                print(transcription)
            } catch {
                print("Error: \(error)")
            }
        }
    }
    
    var body: some Scene {
        WindowGroup {
            ContentView()
        }
    }
}

Then I get an error : Cannot call value of non-function type 'module<WhisperKit>'

What should I do to solve this problem? tks.

Support with older swift version

I have problem being able to develop and run with this sadly.

I am running a AMD cpu windows 11 pc. I am using vmware to get MacOS, however I am not able to run any MacOS version after 12, due to amd cpu not supporting this. This in turn means that I cannot run the later versions of xcode that support swift 5.9.

Would you ever considering backporting some of this functionality for previous versions of swift?

Crash when starting whisperKit in MacOS

image

error message

Could not launch “WhisperAX”
Domain: IDELaunchErrorDomain
Code: 20
Recovery Suggestion: Runningboard has returned error 5. Please check the system logs for the underlying cause of the error.
User Info: {
DVTErrorCreationDateKey = "2024-03-09 13:07:42 +0000";
DVTRadarComponentKey = 968756;
IDERunOperationFailingWorker = IDELaunchServicesLauncher;
}

The operation couldn’t be completed. Launch failed.
Domain: RBSRequestErrorDomain
Code: 5
Failure Reason: Launch failed.

Launchd job spawn failed
Domain: NSPOSIXErrorDomain
Code: 162

Event Metadata: com.apple.dt.IDERunOperationWorkerFinished : {
"device_model" = "Mac15,6";
"device_osBuild" = "14.3.1 (23D60)";
"device_platform" = "com.apple.platform.macosx";
"dvt_coredevice_version" = "355.7.7";
"dvt_mobiledevice_version" = "1643.60.2";
"launchSession_schemeCommand" = Run;
"launchSession_state" = 1;
"launchSession_targetArch" = arm64;
"operation_duration_ms" = 22;
"operation_errorCode" = 20;
"operation_errorDomain" = IDELaunchErrorDomain;
"operation_errorWorker" = IDELaunchServicesLauncher;
"operation_name" = IDERunOperationWorkerGroup;
"param_debugger_attachToExtensions" = 0;
"param_debugger_attachToXPC" = 1;
"param_debugger_type" = 3;
"param_destination_isProxy" = 0;
"param_destination_platform" = "com.apple.platform.macosx";
"param_diag_MainThreadChecker_stopOnIssue" = 0;
"param_diag_MallocStackLogging_enableDuringAttach" = 0;
"param_diag_MallocStackLogging_enableForXPC" = 1;
"param_diag_allowLocationSimulation" = 1;
"param_diag_checker_tpc_enable" = 1;
"param_diag_gpu_frameCapture_enable" = 0;
"param_diag_gpu_shaderValidation_enable" = 0;
"param_diag_gpu_validation_enable" = 0;
"param_diag_memoryGraphOnResourceException" = 0;
"param_diag_queueDebugging_enable" = 1;
"param_diag_runtimeProfile_generate" = 0;
"param_diag_sanitizer_asan_enable" = 0;
"param_diag_sanitizer_tsan_enable" = 0;
"param_diag_sanitizer_tsan_stopOnIssue" = 0;
"param_diag_sanitizer_ubsan_stopOnIssue" = 0;
"param_diag_showNonLocalizedStrings" = 0;
"param_diag_viewDebugging_enabled" = 1;
"param_diag_viewDebugging_insertDylibOnLaunch" = 1;
"param_install_style" = 0;
"param_launcher_UID" = 2;
"param_launcher_allowDeviceSensorReplayData" = 0;
"param_launcher_kind" = 0;
"param_launcher_style" = 99;
"param_launcher_substyle" = 8192;
"param_runnable_appExtensionHostRunMode" = 0;
"param_runnable_productType" = "com.apple.product-type.application";
"param_structuredConsoleMode" = 1;
"param_testing_launchedForTesting" = 0;
"param_testing_suppressSimulatorApp" = 0;
"param_testing_usingCLI" = 0;
"sdk_canonicalName" = "macosx14.2";
"sdk_osVersion" = "14.2";
"sdk_variant" = macos;
}

System Information

macOS Version 14.3.1 (Build 23D60)
Xcode 15.2 (22503) (Build 15C500b)
Timestamp: 2024-03-09T21:07:42+08:00

Resample audio file in chunks to reduce memory usage

let newFrameLength = Int64((sampleRate / audioFile.fileFormat.sampleRate) * Double(audioFile.length))
let outputFormat = AVAudioFormat(standardFormatWithSampleRate: sampleRate, channels: channelCount)!
guard let converter = AVAudioConverter(from: audioFile.processingFormat, to: outputFormat) else {
Logging.error("Failed to create audio converter")
return nil
}
let frameCount = AVAudioFrameCount(audioFile.length)
guard let inputBuffer = AVAudioPCMBuffer(pcmFormat: audioFile.processingFormat, frameCapacity: frameCount),
let outputBuffer = AVAudioPCMBuffer(pcmFormat: outputFormat, frameCapacity: AVAudioFrameCount(newFrameLength))
else {
Logging.error("Unable to create buffers, likely due to unsupported file format")
return nil
}
do {
try audioFile.read(into: inputBuffer, frameCount: frameCount)
} catch {
Logging.error("Error reading audio file: \(error)")
return nil
}

Creating an AVAudioPCMBuffer for the whole input audio buffer can easily surpass iOS memory limits.

Attempting to transcribe a 44100hz, 2 channel, ~1hr long video crashes on iOS due to running out of memory. It would be nice if instead of reading all the input audio into a buffer at once and converting, the audio was read and converted in chunks to reduce the memory usage.

Another less common issue that would be solved by chunking the audio is that AVAudioPCMBuffer has a max size of UInt32.max, which can be hit when transcribing a 1-2hr, 16 channel, 44100hz audio file. This is a fairly typical audio file for a podcast recorded with a RODECaster Pro.

Benchmark for WhisperAX & CLI

It would be great to start collecting reproducible performance benchmarks for supported hardware (e.g. A14+ and M1+). This should be a self-contained function that uses openai/whisper-base by default and optionally other versions that the benchmark submitter selects. Benchmarks should run on a standard set of audio files and reports should be in a digestible and shareable format:

Psuedo-code may look like this:

  1. Detect current hardware and load the models that the user has chosen to benchmark (single, multiple, or all available models)
  2. Download standard audio files from Hugging (jfk.wav for short-form, ted_60.wav and a sample clip from earnings22 for long-form transcriptions)
  3. Generate the transcriptions over several iterations and runtime tabulate statistics.
    • Runs in streaming and file-based "offline" mode - this will require streaming emulation
    • Completes short-form bench and presents results before moving to long-form bench which can potentially take several minutes to complete
    • Will want to track: time to first token, RTF, inference timings (for encoder and decoder), total pipeline timings (model load -> transcription result)
  4. Export these into a markdown table with relevant device info, and current commit hash, which can be posted to GitHub for public tracking

References

Open ASR leaderboard benchmarks: https://github.com/huggingface/open_asr_leaderboard
Nice script for collecting environment info: https://github.com/pytorch/pytorch/blob/main/torch/utils/collect_env.py

Related Issue

#5

Add `brew install trash` to `make setup` script

trash, used in the following make rule, is not part of a default macOS setup.

clean-package-caches:
	@trash ~/Library/Caches/org.swift.swiftpm/repositories
	@trash ~/Library/Developer/Xcode/DerivedData

I see three options to address this:

  1. use rm instead, and delete immediately
  2. use mv, and move the files to the user's trash in ~/.Trash (only works properly if the files are in the local disk; for external hard drives trashes are at /Volumes/NAME_OF_EXTERNAL/.Trashes/USER_ID/, and to handle these cases probably better go with option 3)
  3. install trash using Homebrew in the setup rule.

Originally posted by @metropol in #47

download model failed

How to fix this issue?

Task <60D6EF47-1009-4EFE-9E1B-5988A7FD6E4F>.<1> HTTP load failed, 0/0 bytes (error code: -1200 [3:-9816])
Task <60D6EF47-1009-4EFE-9E1B-5988A7FD6E4F>.<1> finished with error [-1200] Error Domain=NSURLErrorDomain Code=-1200 "An SSL error has occurred and a secure connection to the server cannot be made." UserInfo={NSErrorFailingURLStringKey=https://cdn-lfs-us-1.huggingface.co/repos/8f/fc/8ffc19694b8dfd29ebaafed41040596f15c2a6ee94d3e9f8a0bf0f1523bade3c/6ac1227740ecc2fd7a03df50ac6e2a7f7946acfa77069cf2c486ae0255356b95?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27coremldata.bin%3B+filename%3D%22coremldata.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1710121327&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxMDEyMTMyN319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzhmL2ZjLzhmZmMxOTY5NGI4ZGZkMjllYmFhZmVkNDEwNDA1OTZmMTVjMmE2ZWU5NGQzZTlmOGEwYmYwZjE1MjNiYWRlM2MvNmFjMTIyNzc0MGVjYzJmZDdhMDNkZjUwYWM2ZTJhN2Y3OTQ2YWNmYTc3MDY5Y2YyYzQ4NmFlMDI1NTM1NmI5NT9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=TR-WYiW9gDlLkJIYv-2TaU4UYLNidoOb9oE-OXvBkpsmBHYZ7%7ElhzAoGKa7aqBYGcUnDmmJG0HTJXVyz-6dYbX%7E6vlU8j3x83mJfi2DEPRKzW1RB0tjRx4HMOpuP1G5FMr9CWBvS8M-icXoz-Beyu%7EmyDcLzKISUPV-RFlw1Jm72PiLb5MvCpdw2cdlDfFYUbmzYYIyWsUZsK5YuB6R187AXqM00lIy05xzIOhmuwJzL1XSMzu5-D2WxnNfkBDP4NUiX6OtYhZgJVA9I2ELqmHhOs4qX6HNAXOkxz6KtnuWEpO3N8%7E-yZ%7EPPeNcOudyuAMKw1m2qp0L8JuUxhqCP8Q__&Key-Pair-Id=KCD77M1F0VK2B, NSLocalizedRecoverySuggestion=Would you like to connect to the server anyway?, _kCFStreamErrorDomainKey=3, _NSURLErrorFailingURLSessionTaskErrorKey=LocalDownloadTask <60D6EF47-1009-4EFE-9E1B-5988A7FD6E4F>.<1>, _NSURLErrorRelatedURLSessionTaskErrorKey=(
"LocalDownloadTask <60D6EF47-1009-4EFE-9E1B-5988A7FD6E4F>.<1>"
), NSLocalizedDescription=An SSL error has occurred and a secure connection to the server cannot be made., NSErrorFailingURLKey=https://cdn-lfs-us-1.huggingface.co/repos/8f/fc/8ffc19694b8dfd29ebaafed41040596f15c2a6ee94d3e9f8a0bf0f1523bade3c/6ac1227740ecc2fd7a03df50ac6e2a7f7946acfa77069cf2c486ae0255356b95?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27coremldata.bin%3B+filename%3D%22coremldata.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1710121327&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxMDEyMTMyN319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzhmL2ZjLzhmZmMxOTY5NGI4ZGZkMjllYmFhZmVkNDEwNDA1OTZmMTVjMmE2ZWU5NGQzZTlmOGEwYmYwZjE1MjNiYWRlM2MvNmFjMTIyNzc0MGVjYzJmZDdhMDNkZjUwYWM2ZTJhN2Y3OTQ2YWNmYTc3MDY5Y2YyYzQ4NmFlMDI1NTM1NmI5NT9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=TR-WYiW9gDlLkJIYv-2TaU4UYLNidoOb9oE-OXvBkpsmBHYZ7%7ElhzAoGKa7aqBYGcUnDmmJG0HTJXVyz-6dYbX%7E6vlU8j3x83mJfi2DEPRKzW1RB0tjRx4HMOpuP1G5FMr9CWBvS8M-icXoz-Beyu%7EmyDcLzKISUPV-RFlw1Jm72PiLb5MvCpdw2cdlDfFYUbmzYYIyWsUZsK5YuB6R187AXqM00lIy05xzIOhmuwJzL1XSMzu5-D2WxnNfkBDP4NUiX6OtYhZgJVA9I2ELqmHhOs4qX6HNAXOkxz6KtnuWEpO3N8%7E-yZ%7EPPeNcOudyuAMKw1m2qp0L8JuUxhqCP8Q__&Key-Pair-Id=KCD77M1F0VK2B, NSUnderlyingError=0x600000c7edf0 {Error Domain=kCFErrorDomainCFNetwork Code=-1200 "(null)" UserInfo={_kCFStreamPropertySSLClientCertificateState=0, _kCFNetworkCFStreamSSLErrorOriginalValue=-9816, _kCFStreamErrorDomainKey=3, _kCFStreamErrorCodeKey=-9816, _NSURLErrorNWPathKey=satisfied (Path is satisfied), interface: en0}}, _kCFStreamErrorCodeKey=-9816}
WhisperKit/WhisperKit.swift:194: Fatal error: Unexpectedly found nil while unwrapping an Optional value

Add support for macOS Ventura (13.0)

From what I understood there are some limitations and degradations to the model quality but it would still be nice to be able to support users on Ventura (and iOS 16)

Unable to delete the model

It looks like the model is deleted when I use FIleManager removeAt method but when I re-run the project the deleted model appears again.

FileManager.default.removeItem(at: URL.init(string: "file://" + "(path)")!)

Reduce redundant decoder forward passes by leveraging word-level timestamps

The goal is to leverage the high-quality word-level timestamps added in #38 as anchors to reliably seek the audio buffer forward at a higher frequency compared to current behavior:

  • Current behavior is to seek the audio forward if <|endoftext|> is generated or max_tokens tokens are generated.
  • Current behavior results in wasteful compute because each text token is re-decoded until the audio seeks beyond them.
  • This is up to 29 times redundant (worst case) for a 1 second audio refresh rate and a 30 second audio window for Whisper.

Some timing tokens are included in word timestamps

When filtering out special tokens in addWordTimestamps, word timings that contain a timing token followed by a hyphen aren't filtered out correctly. WordTiming.tokens correctly contains just [532], but WordTiming.word is "<|0.00|> -". This seems to occur most when multiple people are talking over each other in a recording, I guess it's Whisper's way of trying to label speakers.

Crash when starting whisperKit in iOS simulator or visionPro simulator

I am getting this error when trying to start WhisperKit in any simulator. Can someone say what could it be and how to fix?

*** Terminating app due to uncaught exception 'com.apple.coreaudio.avfaudio', reason: 'required condition is false: IsFormatSampleRateAndChannelCountValid(format)'
*** First throw call stack:
(
0 CoreFoundation 0x00000001804bceec exceptionPreprocess + 172
1 libobjc.A.dylib 0x0000000180087068 objc_exception_throw + 56
2 CoreFoundation 0x00000001804bcd90 +[NSException raise:format:] + 0
3 AVFAudio 0x00000001c7789130 Z19AVAE_RaiseExceptionP8NSStringz + 48
4 AVFAudio 0x00000001c77e0b84 ZN17AUGraphNodeBaseV318CreateRecordingTapEmjP13AVAudioFormatU13block_pointerFvP16AVAudioPCMBufferP11AVAudioTimeE + 712
5 AVFAudio 0x00000001c78504d4 -[AVAudioNode installTapOnBus:bufferSize:format:block:] + 1324
6 languagelearn 0x0000000102091988 $s10WhisperKit14AudioProcessorC11setupEngine13inputDeviceIDSo07AVAudioF0CSSSg_tKF + 852
7 languagelearn 0x0000000102090c4c $s10WhisperKit14AudioProcessorC18startRecordingLive13inputDeviceID8callbackySSSg_ySaySfGcSgtKF + 224
8 languagelearn 0x0000000102090b2c $s10WhisperKit14AudioProcessorCAA0C10ProcessingA2aDP18startRecordingLive13inputDeviceID8callbackySSSg_ySaySfGcSgtKFTW + 24
9 languagelearn 0x000000010206bc04 $s13languagelearn11ContentViewV14startRecordingyySbFyyYaYbcfU_TY1
+ 372
10 languagelearn 0x0000000102077ea5 $s13languagelearn11ContentViewV14startRecordingyySbFyyYaYbcfU_TATQ0
+ 1
11 languagelearn 0x0000000102085369 $sxIeghHr_xs5Error_pIegHrzo_s8SendableRzs5NeverORs_r0_lTRTQ0
+ 1
12 languagelearn 0x00000001020873cd $sxIeghHr_xs5Error_pIegHrzo_s8SendableRzs5NeverORs_r0_lTRTATQ0
+ 1
13 libswift_Concurrency.dylib 0x000000020bfbf621 _ZL23completeTaskWithClosurePN5swift12AsyncContextEPNS_10SwiftErrorE + 1
)
libc++abi: terminating due to uncaught exception of type NSException

Publish WhisperKit CLI on Homebrew

It would be great if brew install whisperkit just works and the WhisperKit CLI target on macOS could become an out-of-the-box real-time transcription utility.

Implement memory and latency regression tests

Implement tests to transcribe long audio files (at least several minutes worth) and measure the memory and latency over time. This is to guard against memory leaks or slowdowns potentially being introduced by new PRs (e.g. #40 fixed by #56 thanks to @finnvoor!)

Avoid requiring an internet connection to transcribe

Currently when using the default WhisperKit flow of auto downloading models on transcribe, an internet connection is required even if models have already been downloaded in the past due to swift-transformers fetching the filenames here.

This is a bit limiting, as e.g. @pveugen was on a train with poor internet and couldn't transcribe audio even after downloading the model in the past (after #80 it would throw an error instead of crashing). I think we could get around this by manually downloading and specifying the path in setupModels modelFolder:, but it would be nice if there was a way to avoid this HTTP get by default.

Word level timestamps

Segment level timestamps look good, great work guys.

Are token level timestamps currently supported somehow, or on the roadmap?

Unable to load model in CLI

Hey folks! I'm trying to use the CLI, but it fails to load models:

Building for debugging...
Build complete! (0.07s)
Error: Unable to load model: file:///Users/usmanm/whisperkit/Models/whisperkit-coreml/openai_whisper-tiny/MelSpectrogram.mlmodelc/. Compile the model with Xcode or `MLModel.compileModel(at:)`.

The setup instructions seemed to have worked correctly:

➜  whisperkit git:(main) make setup
Setting up environment...
/opt/homebrew/bin/pip3
/opt/homebrew/bin/python3
Requirement already satisfied: huggingface_hub in /opt/homebrew/lib/python3.11/site-packages (0.20.3)
Requirement already satisfied: filelock in /opt/homebrew/lib/python3.11/site-packages (from huggingface_hub) (3.13.1)
Requirement already satisfied: fsspec>=2023.5.0 in /opt/homebrew/lib/python3.11/site-packages (from huggingface_hub) (2023.10.0)
Requirement already satisfied: requests in /opt/homebrew/lib/python3.11/site-packages (from huggingface_hub) (2.31.0)
Requirement already satisfied: tqdm>=4.42.1 in /opt/homebrew/lib/python3.11/site-packages (from huggingface_hub) (4.66.1)
Requirement already satisfied: pyyaml>=5.1 in /opt/homebrew/lib/python3.11/site-packages (from huggingface_hub) (6.0.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/homebrew/lib/python3.11/site-packages (from huggingface_hub) (4.8.0)
Requirement already satisfied: packaging>=20.9 in /opt/homebrew/lib/python3.11/site-packages (from huggingface_hub) (23.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/homebrew/lib/python3.11/site-packages (from requests->huggingface_hub) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /opt/homebrew/lib/python3.11/site-packages (from requests->huggingface_hub) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/homebrew/lib/python3.11/site-packages (from requests->huggingface_hub) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /opt/homebrew/lib/python3.11/site-packages (from requests->huggingface_hub) (2023.7.22)
usmanm
Already logged in to Hugging Face.
➜  whisperkit git:(main) make download-models
Downloading compressed models...
Repository exists, pulling latest changes...
HEAD is now at 07ea546 Create config.json

The app “WhisperAX” has been killed by the operating system because it is using too much memory.

The app crashes after recording a few seconds of sound. It's being used on an iPhone 12 mini device that has been cold restarted, with Large-v2_1050MB.

The app “WhisperAX” has been killed by the operating system because it is using too much memory.
Domain: IDEDebugSessionErrorDomain
Code: 11
Recovery Suggestion: Use a memory profiling tool to track the process memory usage.
User Info: {
    DVTErrorCreationDateKey = "2024-03-12 18:15:07 +0000";
    IDERunOperationFailingWorker = DBGLLDBLauncher;
}
--
The app “WhisperAX” has been killed by the operating system because it is using too much memory.
Domain: IDEDebugSessionErrorDomain
Code: 11
Recovery Suggestion: Use a memory profiling tool to track the process memory usage.
User Info: {
    IDERunOperationFailingWorker = DBGLLDBLauncher;
}
--

Event Metadata: com.apple.dt.IDERunOperationWorkerFinished : {
    "device_isCoreDevice" = 1;
    "device_model" = "iPhone13,1";
    "device_osBuild" = "17.3.1 (21D61)";
    "device_platform" = "com.apple.platform.iphoneos";
    "dvt_coredevice_version" = "355.24";
    "dvt_mobiledevice_version" = "1643.100.58";
    "launchSession_schemeCommand" = Run;
    "launchSession_state" = 2;
    "launchSession_targetArch" = arm64;
    "operation_duration_ms" = 968315;
    "operation_errorCode" = 11;
    "operation_errorDomain" = IDEDebugSessionErrorDomain;
    "operation_errorWorker" = DBGLLDBLauncher;
    "operation_name" = IDERunOperationWorkerGroup;
    "param_debugger_attachToExtensions" = 0;
    "param_debugger_attachToXPC" = 1;
    "param_debugger_type" = 3;
    "param_destination_isProxy" = 0;
    "param_destination_platform" = "com.apple.platform.iphoneos";
    "param_diag_MainThreadChecker_stopOnIssue" = 0;
    "param_diag_MallocStackLogging_enableDuringAttach" = 0;
    "param_diag_MallocStackLogging_enableForXPC" = 1;
    "param_diag_allowLocationSimulation" = 1;
    "param_diag_checker_tpc_enable" = 1;
    "param_diag_gpu_frameCapture_enable" = 0;
    "param_diag_gpu_shaderValidation_enable" = 0;
    "param_diag_gpu_validation_enable" = 0;
    "param_diag_memoryGraphOnResourceException" = 0;
    "param_diag_queueDebugging_enable" = 1;
    "param_diag_runtimeProfile_generate" = 0;
    "param_diag_sanitizer_asan_enable" = 0;
    "param_diag_sanitizer_tsan_enable" = 0;
    "param_diag_sanitizer_tsan_stopOnIssue" = 0;
    "param_diag_sanitizer_ubsan_stopOnIssue" = 0;
    "param_diag_showNonLocalizedStrings" = 0;
    "param_diag_viewDebugging_enabled" = 1;
    "param_diag_viewDebugging_insertDylibOnLaunch" = 1;
    "param_install_style" = 2;
    "param_launcher_UID" = 2;
    "param_launcher_allowDeviceSensorReplayData" = 0;
    "param_launcher_kind" = 0;
    "param_launcher_style" = 99;
    "param_launcher_substyle" = 8192;
    "param_runnable_appExtensionHostRunMode" = 0;
    "param_runnable_productType" = "com.apple.product-type.application";
    "param_structuredConsoleMode" = 1;
    "param_testing_launchedForTesting" = 0;
    "param_testing_suppressSimulatorApp" = 0;
    "param_testing_usingCLI" = 0;
    "sdk_canonicalName" = "iphoneos17.4";
    "sdk_osVersion" = "17.4";
    "sdk_variant" = iphoneos;
}
--


System Information

macOS Version 14.2.1 (Build 23C71)
Xcode 15.3 (22618) (Build 15E204a)
Timestamp: 2024-03-12T11:15:07-07:00

Support for MacOS 13.0

Hi folks, just wanted to check in and ask what would be entailed in adding support for older mac versions, such as 13.0?

Language Detection

Language detection here should be fairly simple with logits filters now, it will entail a single decoder pass and sample just the language tokens. However, this cannot be used when we are using a prefill prompt (i.e. forced decoder tokens) so that will need special handling.

References

Openai implementation: https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/decoding.py#L19
WhisperKit inline todo:

Specialization takes a really long time

I'm trying the demo app on a MacBook Pro with Apple M1 Pro and 16 GB memory. The large-v3_turbo_1049MB model has been specializing for more than 30 minutes, but aned is still running and using a whole performance core. Have you guys tested the loading time on different devices?

the example is unable to run on iphone 11 pro.

the example is unable to run on iphone 11 pro.
(The example is running good on Mac m1 max )

The following is the screen shot on iphone 11 pro. Base Model
image

debug log:
[WhisperKit] --------------- DECODER INPUTS DEBUG ---------------
[WhisperKit] Cache Length: 2 Input Token: 50359
[WhisperKit] Key Cache | Val Cache | Update Mask | Decoder Mask | Position
[WhisperKit] -0.125732 | 0.048828 | 0 | 0 | 0
[WhisperKit] 0.308350 | -0.556641 | 0 | 0 | 1
[WhisperKit] 0.000000 | 0.000000 | 1 | 0 | 2
[WhisperKit] 0.000000 | 0.000000 | 0 | -10000 | 3
[WhisperKit] [0.00 --> 14.90]
[WhisperKit] ---- Transcription Timings ----
[WhisperKit] Audio Load: 0.00 ms / 1 runs ( 0.00 ms/run) 0.00%
[WhisperKit] Audio Processing: 0.41 ms / 1 runs ( 0.41 ms/run) 0.03%
[WhisperKit] Mels: 57.57 ms / 1 runs ( 57.57 ms/run) 3.96%
[WhisperKit] Encoding: 1171.59 ms / 1 runs ( 1171.59 ms/run) 80.56%
[WhisperKit] Matrices Init: 5.36 ms / 1 runs ( 5.36 ms/run) 0.37%
[WhisperKit] Prefill: 0.49 ms / 1 runs ( 0.49 ms/run) 0.03%
[WhisperKit] Decoding: 208.06 ms / 4 runs ( 52.01 ms/run) 14.31%
[WhisperKit] Non-inference: 7.49 ms / 4 runs ( 1.87 ms/run) 0.52%
[WhisperKit] - Sampling: 4.13 ms / 4 runs ( 1.03 ms/run) 0.28%
[WhisperKit] - Kv Caching: 3.91 ms / 4 runs ( 0.98 ms/run) 0.27%
[WhisperKit] - Windowing: 0.08 ms / 1 runs ( 0.08 ms/run) 0.01%
[WhisperKit] Fallbacks: 122.98 ms / 0 runs ( 0.00 ms/run) 8.46%
[WhisperKit] Decoding Full Loop: 1448.16 ms / 4 runs ( 362.04 ms/run) 99.57%
[WhisperKit] -------------------------------
[WhisperKit] Model Load Time: 6.60 seconds
[WhisperKit] Inference Duration: 1.45 seconds
[WhisperKit] - Decoding Loop: 1.45 seconds
[WhisperKit] Time to first token: 1.30 seconds
[WhisperKit] Total Tokens: 5
[WhisperKit] Tokens per Second: 2.76 tok/s
[WhisperKit] Real Time Factor: 0.10
[WhisperKit] Fallbacks: 0.0
[WhisperKit] [0.00 --> 14.90] <|endoftext|>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.