The timings of segments/words are sometimes inaccurate. When the attached audio is tra

Thanks for the report <a class="user-mention notranslate" data-hovercard-type="user" d

Low-hanging fruits: Leverage the redundancy in segment and wor

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Incorrect timestamps (0.5sec off) about whisperkit HOT 7 CLOSED

finnvoor commented on May 23, 2024

Incorrect timestamps (0.5sec off)

from whisperkit.

Comments (7)

ZachNagengast commented on May 23, 2024 2

Quick update, I've identified the issue and am putting together a patch for this now.

from whisperkit.

atiorh commented on May 23, 2024

Thanks for the report @finnvoor! We started relying on the accuracy of word timestamps in streaming mode too. This is important, so we will triage and address it.

from whisperkit.

atiorh commented on May 23, 2024

Low-hanging fruits:

Leverage the redundancy in segment and word-level timestamps for consistency checks
Implement median filtering in DTW as in the original implementation even though it didn't have a major impact in our early tests

from whisperkit.

atiorh commented on May 23, 2024

@finnvoor Please confirm that this fixes your issue 🙏

from whisperkit.

finnvoor commented on May 23, 2024

@ZachNagengast @atiorh gave it a quick test and the start times seem much more precise, thanks for the quick improvement.

It does seem like this has made the end times of words/segments slightly worse though. Previously, the end times would sometimes include some silence (be too late), but they never seemed to include any of the last word, so were good for splitting after a word/sentence. Now it seems like it accounts for silence at the end better, but seems to go a bit too far and includes the end of the word. In the same example at ~4s the word "gas" used to end at 4.06, now ends at 3.62, but should end at ~3.8.

We'll continue to test it a bit more today.

from whisperkit.

ZachNagengast commented on May 23, 2024

I see, good to know, we might be able to improve this with some VAD (shift the end time to the last point that the sound level was past a threshold), but this is also the same endpoint that openai/whisper gives for their word timestamps, so it might be a model issue, or need a bit more massaging to get perfect. There are many such so called "hacks" in the main repo that could be improved.

For detail: the reason it ended that far past the audio previously is because we were including the punctuation token ".", which has non-zero length, as part of the word's end time, the fix removed that time entirely, so it is ending exactly where it things the word "gas" ends, before the punctuation. Next step may be to consider some middle ground where the punctuation counts for some time but not the full token because it's not a spoken word. Open to ideas here too!

from whisperkit.

finnvoor commented on May 23, 2024

Got it, figured eventually we'd run into model limits. I think in our case I'll try just adding a small offset to the end since it seems pretty consistent, and in general adding silence is better than cutting words. VAD would be really nice but sounds a bit tricky to implement.

from whisperkit.

Recommend Projects

Incorrect timestamps (0.5sec off) about whisperkit HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent