Giter Club home page Giter Club logo

Comments (7)

ZachNagengast avatar ZachNagengast commented on May 23, 2024 2

Quick update, I've identified the issue and am putting together a patch for this now.

from whisperkit.

atiorh avatar atiorh commented on May 23, 2024

Thanks for the report @finnvoor! We started relying on the accuracy of word timestamps in streaming mode too. This is important, so we will triage and address it.

from whisperkit.

atiorh avatar atiorh commented on May 23, 2024

Low-hanging fruits:

  • Leverage the redundancy in segment and word-level timestamps for consistency checks
  • Implement median filtering in DTW as in the original implementation even though it didn't have a major impact in our early tests

from whisperkit.

atiorh avatar atiorh commented on May 23, 2024

@finnvoor Please confirm that this fixes your issue 🙏

from whisperkit.

finnvoor avatar finnvoor commented on May 23, 2024

@ZachNagengast @atiorh gave it a quick test and the start times seem much more precise, thanks for the quick improvement.

It does seem like this has made the end times of words/segments slightly worse though. Previously, the end times would sometimes include some silence (be too late), but they never seemed to include any of the last word, so were good for splitting after a word/sentence. Now it seems like it accounts for silence at the end better, but seems to go a bit too far and includes the end of the word. In the same example at ~4s the word "gas" used to end at 4.06, now ends at 3.62, but should end at ~3.8.

Logic Pro - Untitled - Tracks@2x

We'll continue to test it a bit more today.

from whisperkit.

ZachNagengast avatar ZachNagengast commented on May 23, 2024

I see, good to know, we might be able to improve this with some VAD (shift the end time to the last point that the sound level was past a threshold), but this is also the same endpoint that openai/whisper gives for their word timestamps, so it might be a model issue, or need a bit more massaging to get perfect. There are many such so called "hacks" in the main repo that could be improved.

For detail: the reason it ended that far past the audio previously is because we were including the punctuation token ".", which has non-zero length, as part of the word's end time, the fix removed that time entirely, so it is ending exactly where it things the word "gas" ends, before the punctuation. Next step may be to consider some middle ground where the punctuation counts for some time but not the full token because it's not a spoken word. Open to ideas here too!

from whisperkit.

finnvoor avatar finnvoor commented on May 23, 2024

Got it, figured eventually we'd run into model limits. I think in our case I'll try just adding a small offset to the end since it seems pretty consistent, and in general adding silence is better than cutting words. VAD would be really nice but sounds a bit tricky to implement.

from whisperkit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.