Giter Club home page Giter Club logo

Comments (12)

mkiol avatar mkiol commented on June 12, 2024 1

Thanks for the report.

This is an error from pulse-audio API integration added in v4.5.0.

Does your system have a pulse-audio or pipewire server?

from dsnote.

mkiol avatar mkiol commented on June 12, 2024 1

PULSE_SERVER=/run/user/$(id -u)/pulse/native flatpak run net.mkiol.SpeechNote --verbose
This allows me to launch dsnote, however is there a way that dsnote can take care of this?

I'm very glad you were able to find a workaround 👍🏿 I want to better understand why this problem occurred. I've tried fresh Debian 12 + GNOME installation and didn't have this issue. Did you make any significant changes in audio server on your system? How can I reproduce this problem?

from dsnote.

mkiol avatar mkiol commented on June 12, 2024 1

Ah, in the first posted log, there was one line [D] 07:41:09.766380424.766 0x7f86fb5c8d00 () - platform: "xcb"

This xcb is just an indicator that you are on X11. Nothing to worry about.

The GUI currently run on my workstation is suckless tools combo dmenu+dwm+ssterm
https://tools.suckless.org/
maybe that's the reason why PA behind-the-scene setting has not been picked up by dwm on my workstation as yours in Gnome did?

Thanks for the hint. I will investigate it.

But it does not make much sense, faster-whisper models are supposed to need less RAM than whisper models..

Actually, "Whisper" models are implemented in Speech Note with whisper.cpp library which is optimized for minimal RAM and CPU use. It is a bit confusing because of naming, but Speech Note does not use original "Whisper" implementation and models from OpenAI. It uses only optimized versions. Both "Whisper" and "Faster Whisper" are almost equally efficient.

from dsnote.

mkiol avatar mkiol commented on June 12, 2024 1

I think this issue can be closed now.

from dsnote.

h9j6k avatar h9j6k commented on June 12, 2024

Thanks for pointing me to pulseaudio I had no clue what pa meant previously.

I did a quick search on google and came up with a hack where I have to manually do the following each time:
First kill any PA processes before launching dsnote and start PA with debug mode, then add env var PULSE_SERVER to flatpak run i.e.,

$pulseaudio -k 
$pulseaudio -v 
$PULSE_SERVER=/run/user/$(id -u)/pulse/native flatpak run net.mkiol.SpeechNote --verbose

This allows me to launch dsnote, however is there a way that dsnote can take care of this?

from dsnote.

h9j6k avatar h9j6k commented on June 12, 2024

Now there is a new situation. With Addon.Nvidia installed, if I try to transcribe an audio with faster-whisper, it exits with invalid task id

[D] 13:04:34.037341885.37 0x7fcb0f586d00 () - opening file: "/home/user/Downloads/audio.m4a" -1
[D] 13:04:34.037376973.37 0x7fcb0f586d00 init_av_in_format:594 - opening file: /home/user/Downloads/audio.m4a
[D] 13:04:34.115270445.115 0x7fcb0f586d00 () - "audio=[[index=0, media-type=audio, title=, lang=und], ], video=], subtitles="
[D] 13:04:34.115322668.115 0x7fcb0f586d00 () - requested stream index for transcribe: 0
[D] 13:04:34.115347012.115 0x7fcb0f586d00 () - stt transcribe file
[D] 13:04:34.116419091.116 0x7fcb0f586d00 () - default tts model not found: "en"
[D] 13:04:34.116429944.116 0x7fcb0f586d00 () - default mnt lang not found: "en"
[D] 13:04:34.116434694.116 0x7fcb0f586d00 () - new default mnt lang: "en"
[D] 13:04:34.116442408.116 0x7fcb0f586d00 () - choosing model for id: "de_fasterwhisper_large3" "en"
[D] 13:04:34.116463339.116 0x7fcb0f586d00 () - gpu device str: ("CUDA", " 0", " Quadro K2200")
[D] 13:04:34.116481995.116 0x7fcb0f586d00 () - restart stt engine config: "lang=de, lang_code=, model-files=[model-file=/home/user/.var/app/net.mkiol.SpeechNote/cache/net.mkiol/dsnote/speech-models/multilang_fasterwhisper_large3, scorer-file=, ttt-model-file=], speech-mode=automatic, vad-mode=aggressiveness-3, speech-started=0, text-format=subrip, options=t, use-gpu=1, gpu-device=[id=0, api=cuda, name=Quadro K2200, platform-name=], sub-config=[min-segment-dur=4, min-line-length=0, max-line-length=0]"
[D] 13:04:34.116489723.116 0x7fcb0f586d00 () - new stt engine required
[D] 13:04:34.121257343.121 0x7fcb0f586d00 start:224 - stt start
[D] 13:04:34.121346781.121 0x7fcb0f586d00 start:234 - stt start completed
[D] 13:04:34.121373437.121 0x7fcb0f586d00 () - requested stream index: 0
[D] 13:04:34.121377712.121 0x7fc95bd9e680 process:283 - stt processing started
[D] 13:04:34.121401210.121 0x7fcb0f586d00 () - creating audio source
[D] 13:04:34.121403130.121 0x7fc95bd9e680 set_state:469 - stt state: idle => initializing
[D] 13:04:34.121408682.121 0x7fc95bd9e680 set_state:476 - speech detection status: no-speech => initializing (no-speech)
[D] 13:04:34.121415379.121 0x7fcb0f586d00 decompress_to_data_raw_async:1381 - task decompress to data raw async
[D] 13:04:34.121421565.121 0x7fc95bd9e680 create_model:91 - creating fasterwhisper model
[D] 13:04:34.121433162.121 0x7fc95bd9e680 execute:55 - task pushed
[D] 13:04:34.121442775.121 0x7fcb0f586d00 init_av_in_format:594 - opening file: /home/user/Downloads/audio.m4a
[D] 13:04:34.121473424.121 0x7fcb08bd9680 loop:130 - py task execution: start
[D] 13:04:34.121862019.121 0x7fcb08bd9680 operator():101 - cpu info: arch=x86_64, cores=8
[D] 13:04:34.121883966.121 0x7fcb08bd9680 operator():103 - using threads: 8/8
[D] 13:04:34.121890705.121 0x7fcb08bd9680 operator():105 - using device: cuda 0
[D] 13:04:34.122995551.122 0x7fcb0f586d00 init_av_in_format:688 - stream index requested => selecting stream: 0
[D] 13:04:34.123004810.123 0x7fcb0f586d00 init_av:744 - input codec: aac (86018)
[D] 13:04:34.123008437.123 0x7fcb0f586d00 init_av:748 - requested out format: unknown
[D] 13:04:34.123240519.123 0x7fcb0f586d00 init_av:872 - encoder name: pcm_s16le
[D] 13:04:34.123250385.123 0x7fcb0f586d00 init_av:1035 - decoder frame-size: 2048
[D] 13:04:34.123253177.123 0x7fcb0f586d00 init_av:1038 - encoder frame-size: 0
[D] 13:04:34.123255777.123 0x7fcb0f586d00 init_av:1040 - time-base change: 1/48000 => 1/16000
[D] 13:04:34.123258380.123 0x7fcb0f586d00 init_av:1049 - sample-format change: fltp => s16
[D] 13:04:34.123260782.123 0x7fcb0f586d00 init_av:1051 - sample-rate change: 48000 => 16000
[D] 13:04:34.124531576.124 0x7fcb0f586d00 init_av_filter:394 - filter src args: sample_rate=48000:sample_fmt=fltp:time_base=1/48000:channel_layout=stereo
[D] 13:04:34.211255296.211 0x7fcb0f586d00 init_av:1110 - output format: 
[D] 13:04:34.211448642.211 0x7fc9527fc680 operator():1361 - process started
[D] 13:04:34.211639247.211 0x7fcb0f586d00 () - service refresh status, new state: transcribing-file
[D] 13:04:34.211652041.211 0x7fcb0f586d00 () - service state changed: idle => transcribing-file
[D] 13:04:34.211662457.211 0x7fcb0f586d00 () - task state changed: 0 => 3
[D] 13:04:34.211672713.211 0x7fcb0f586d00 () - import file result: ok-import-audio
[D] 13:04:34.211929764.211 0x7fcb0f586d00 () - service refresh status, new state: transcribing-file
[D] 13:04:34.211948028.211 0x7fcb0f586d00 () - transcribe progress: 0
[D] 13:04:34.211957016.211 0x7fcb0f586d00 () - app current task: -1 => 0
[D] 13:04:34.211964904.211 0x7fcb0f586d00 () - app task state: idle => initializing
[D] 13:04:34.212564529.212 0x7fcb0f586d00 () - app service state: idle => transcribing-file
[W] 13:04:34.217262705.217 0x7fcb0f586d00 () - no available mnt langs
[W] 13:04:34.217283144.217 0x7fcb0f586d00 () - no available mnt out langs
[W] 13:04:34.217287436.217 0x7fcb0f586d00 () - no available tts models for in mnt
[W] 13:04:34.217290581.217 0x7fcb0f586d00 () - no available tts models for out mnt
[W] 13:04:34.217295050.217 0x7fcb0f586d00 () - invalid task id

from dsnote.

mkiol avatar mkiol commented on June 12, 2024

With Addon.Nvidia installed, if I try to transcribe an audio with faster-whisper, it exits

Did it work in the previous version? In the new version 4.5.0, CUDA runtime has been updated from 12.2.2 to 12.4.0, so this is very minor update and I don't think it could break anything. Does the problem also occur with "Whisper" models (not Faster Whisper)?

from dsnote.

h9j6k avatar h9j6k commented on June 12, 2024

I've tried fresh Debian 12 + GNOME installation

You are awesome!!!

Did you make any significant changes in audio server on your system? How can I reproduce this problem?

Ah, in the first posted log, there was one line [D] 07:41:09.766380424.766 0x7f86fb5c8d00 () - platform: "xcb"

I am not sure the xcb platform is the same as in Gnome verbose info, since I am not using any of those heavy desktop environments. The GUI currently run on my workstation is suckless tools combo dmenu+dwm+ssterm

https://tools.suckless.org

maybe that's the reason why PA behind-the-scene setting has not been picked up by dwm on my workstation as yours in Gnome did?

Does the problem also occur with "Whisper" models (not Faster Whisper)?

Before the 4.5.0 upgrade, on 4.4.0 faster-whisper models didn't work at all, either large v3 or medium.

On 4.5.0 though, I just noticed one interesting thing, If I use faster-whisper medium model, it works just fine, only when switching to large v3, it would exit without saying anything useful in verbose log.

As for whisper models, they are always working, large or medium, on both 4.4.0 and 4.5.0.

Could this be a RAM issue of not having enough RAM?? (I have only 16GB ECC RAM)

But it does not make much sense, faster-whisper models are supposed to need less RAM than whisper models..

Edit: grammar, on --> one, modes --> models

from dsnote.

h9j6k avatar h9j6k commented on June 12, 2024

Thanks for the clarification on whisper/faster-whisper/openai whisper models. Then would it be better to show whisper as whisper.cpp instead of whisper only in dsnote menu?

Also on faster-whisper models, I found that on 4.5.0, actually medium faster-whisper model does not work either, I noticed this line from the log,

[E] 02:46:00.019774119.19 0x7fa21f7fe680 operator():340 - fasterwhisper py error: RuntimeError: cuDNN failed with status CUDNN_STATUS_ALLOC_FAILED

I mistook working OpenCL transcription for CUDA yesterday, i.e., it was CPU transcription with faster-whisper that worked, GPU transcription with faster-whisper didn't.

Could it be that GPU does not have enough VRAM? My Quadro K2200 only has 4GB VRAM onboard :(

Edit: I tried faster-whisper small model, it works and there is one line from log

[2024-05-20 12:02:24.533] [ctranslate2] [thread 9] [warning] The compute type inferred from the saved model is int8_float32, but the target device or backend do not support efficient int8_float32 computation. The model weights have been automatically converted to use the float32 compute type instead.

So I think the culprit is the GPU not having enough VRAM, maybe CUDA GPU w/ at least 8GB VRAM such as Quadro T1000 etc. is needed for large v3 faster-whisper??

Edit2: there is a mention of CUDA compute capability 6.1 being required for INT8, mine supports only 5.0

SYSTRAN/faster-whisper#42 (comment)

from dsnote.

mkiol avatar mkiol commented on June 12, 2024

Then would it be better to show whisper as whisper.cpp instead of whisper only in dsnote menu?

Definitely. I have to rename them. The name is "Whisper" because initially only one "Whisper" engine, based on whisper.cpp, was implemented. Then I added "Faster Whisper" and this confusion was created.

Also on faster-whisper models, I found that on 4.5.0, actually medium faster-whisper model
RuntimeError: cuDNN failed with status CUDNN_STATUS_ALLOC_FAILED
Could it be that GPU does not have enough VRAM? My Quadro K2200 only has 4GB VRAM onboard :(

Edit: I tried faster-whisper small model, it works and there is one line from log

[2024-05-20 12:02:24.533] [ctranslate2] [thread 9] [warning] The compute type inferred from the saved model is int8_float32, but the target device or backend do not support efficient int8_float32 computation. The model weights have been automatically converted to use the float32 compute type instead.

So I think the culprit is the GPU not having enough VRAM, maybe CUDA GPU w/ at least 8GB VRAM such as Quadro T1000 etc. is needed for large v3 faster-whisper??

Edit2: there is a mention of CUDA compute capability 6.1 being required for INT8, mine supports only 5.0

This is very interesting. All "Faster Whisper" models in Speech Note are INT8 because they are most efficient on CPU. From my tests, on GPU, difference between INT8 and F16 i minimal. Maybe your card is different.

Would you be able to test F19 models from this site: https://huggingface.co/guillaumekln/faster-whisper-medium.en. To add them to Speech Note you just need to manually edit ~/.var/app/net.mkiol.SpeechNote/data/net.mkiol/dsnote/models.json file and add the following entry:

        {
            "name": "English (FasterWhisper Medium F16)",
            "model_id": "en_fasterwhisper_medium_f16",
            "engine": "stt_fasterwhisper",
            "lang_id": "en",
            "checksum": "a9514015",
            "checksum_quick": "97c57278",
            "size": "1530465940",
            "comp": "dir",
            "urls": [
                "https://huggingface.co/guillaumekln/faster-whisper-medium.en/resolve/83a3b718775154682e5f775bc5d5fc961d2350ce/model.bin",
                "https://huggingface.co/guillaumekln/faster-whisper-medium.en/resolve/83a3b718775154682e5f775bc5d5fc961d2350ce/config.json",
                "https://huggingface.co/guillaumekln/faster-whisper-medium.en/resolve/83a3b718775154682e5f775bc5d5fc961d2350ce/tokenizer.json",
                "https://huggingface.co/guillaumekln/faster-whisper-medium.en/resolve/83a3b718775154682e5f775bc5d5fc961d2350ce/vocabulary.txt"
            ]
        }

After restarting, you should be able to download and test "English (FasterWhisper Medium F16)" model.

from dsnote.

h9j6k avatar h9j6k commented on June 12, 2024

All "Faster Whisper" models in Speech Note are INT8 because they are most efficient on CPU. From my tests, on GPU, difference between INT8 and F16 is minimal

My guess would be on GPU it is the opposite that of on CPU, from what I read, only high end and more recent CUDA GPUs can speed up f16/f8/f4 computing (because of having tensor cores??), however most of CUDA GPUs can do normal float32 (because CUDA cores can handle them). Maybe the fact in your testing difference is minimal is due to f16 has not been taken advantage of in computing, e.g., if you use an RTX 4000 SFF Ada (too expensive!!!) or similar there could be a noticeable difference??

Would you be able to test F16 models from this site: https://huggingface.co/guillaumekln/faster-whisper-medium.en.

Unfortunately, it crashes transcription with medium f16 model, I think the same reason applies, my GPU 4GB VRAM is not big enough (maybe it's time to look for another CUDA GPU w/ 8GB VRAM). K2200 can only handle small faster whisper model and when I use nvidia-smi, I saw that even with small model, dsnote cached 2.7GB VRAM. Also even when transcription is complete, it won't release the VRAM it cached. I think it is due to how ctranslate2 cache works, which is quite different from whisper.cpp.

https://opennmt.net/CTranslate2/environment_variables.html

CT2_CUDA_ALLOCATOR
Allocating memory on the GPU with cudaMalloc is costly and is best avoided in high-performance code. For this reason CTranslate2 integrates caching allocators which enable a fast reuse of previously allocated buffers. The following allocators are integrated:

cuda_malloc_async (default for CUDA >= 11.2)
Uses the [asynchronous allocator with memory pools] introduced in CUDA 11.2.

cub_caching (default for CUDA < 11.2)
Uses the caching allocator from the [CUB project]

When I use whisper.cpp, not only can it handle large v3 model (with large v3, it uses VRAM about 2GB), when transcription is done, it also releases the VRAM.

At the moment whisper.cpp large v3 and faster-whisper small are my go-to models (for accuracy and speed respectively).

from dsnote.

mkiol avatar mkiol commented on June 12, 2024

Unfortunately, it crashes transcription with medium f16 model, I think the same reason applies, my GPU 4GB VRAM is not big enough
I think it is due to how ctranslate2 cache works, which is quite different from whisper.cpp.

Thanks for your analysis and tests. This is an additional reason why whisper.cpp may be a better choice than faster-whisper in general.

At the moment whisper.cpp large v3 and faster-whisper small are my go-to models (for accuracy and speed respectively).

Cool :)

from dsnote.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.