Performance report. Meaning V2 and V3: V2 its before <a href="https://github.com/g

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Interesting drop performance for t > 8 <p dir="auto"

In my previos example its just parallel jobs in bash : <div class="highlight

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Performance Xeon about whisper.cpp HOT 7 CLOSED

ArtyomZemlyak commented on May 12, 2024 1

Performance Xeon

from whisper.cpp.

Comments (7)

ggerganov commented on May 12, 2024 1

Hi @kevin01881 and thanks for the kind words.
Btw, testing on AMD CPUs I find that whisper.cpp performance is comparable (maybe slightly faster) with the stock PyTorch implementation. Just make sure to run the PyTorch version with the Greedy decoder to make things even. I don't have an Intel CPU though, so not sure how it compares there.

But yeah, on M1 I think we still have a big edge - probably 2 or 3 times faster (I haven't done a proper benchmark yet).
Probably this will be the case until PyTorch has proper support for Arm processors.

Btw, on this note, someone reported that on M1 Max it is efficient to split the job into multiple runs with fewer threads [0].
I guess, we should have a built-in option in whisper.cpp to split the job in N tasks and run the multiple inferences - similar to what @ArtyomZemlyak did earlier in this thread.

[0] openai/whisper#208 (reply in thread)

from whisper.cpp.

ArtyomZemlyak commented on May 12, 2024

Interesting drop performance for t > 8

from whisper.cpp.

ggerganov commented on May 12, 2024

Interesting drop performance for t > 8

Yes, i've noticed that. I have 2 guesses:

The computation is memory bound so at some point increasing the number of threads does not help because the memory bandwidth is full
I have a problem in my thread synchronization implementation - currently, I use "busy-waiting" on atomic variables which you probably noticed keeps the CPU's at 100% all the time. This is much faster compared to locking mutexes. However, I am not sure if it has some negative side effects for large number of threads. Needs some investigation

The last section V3 is surprising - I don't expect the encode time to be different for different files, given that they are the same length. Something is not right there.

The "parallel" idea is very interesting - I never realised that we can split the file in chunks and run multiple whisper.cpp processes in parallel. This might be a very efficient approach for multi-core systems.
Can you provide some more information about your parallel approach? How did you split the audio?

I think we have to provide an offset argument to main to be able to start the transcription at different start offset of the audio file.

from whisper.cpp.

ArtyomZemlyak commented on May 12, 2024

In my previos example its just parallel jobs in bash script:

start=$SECONDS

export MODEL=tiny
# export MODEL=base
# export MODEL=small
# export MODEL=large

export THREADS=4

./main  --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/cuker1.wav &
./main  --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/cuker2.wav &
./main  --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/cuker_frag1.wav &
./main  --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/gokov1.wav &
./main  --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/gokov2.wav &
./main  --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/fragmen1t.wav &
./main  --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/very_bad_sample.wav &

wait

duration=$(( SECONDS - start ))

echo ""
echo "TOTAL_TIME:"
echo $duration

But if we need same effect on real audio, we can try to use 2 approaches:

VAD - voice activity detection. Find all chunks, where voice exist.
Split finded chunks to little chunks (if they long, > 30 s) and put them to different processes.

But we need synhronize time for output - need remeber timings of chunks and add this timings to resulted output.

from whisper.cpp.

ArtyomZemlyak commented on May 12, 2024

Or we just can run multiple apps for whisper.cpp - just process multiple audio files in one time. If we dont need fastest recognition of one file, but need a lot of AudioSeconds recognized for ProcessingHour

from whisper.cpp.

kevin01881 commented on May 12, 2024

@ggerganov Thanks very much sir for making whisper.cpp!! It is pure insanity that I can run a model that requires 12 GB of VRAM, on my ultra-slow PC that is pushing 8 years old (i7-5500U). You are a wizard.

This shows how most of todays models are written very poorly as far as efficiency goes. Truly makes one wonder what else we could be running on CPU's that currently requires RTX 3090's or even T4/A100's.

So far, I succesfully ran on this ancient computer: Facebook Research Demucs (stock, no optimized port), Stable Diffusion (openvino port), and thanks to your C++ port now Whisper as well.

from whisper.cpp.

i-am-neo commented on May 12, 2024

@ArtyomZemlyak Careful with the output you get when fragmenting audio for parallel inference jobs.
See openai/whisper#440

cc @ggerganov

from whisper.cpp.

Performance Xeon about whisper.cpp HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent