mkiol / dsnote Goto Github PK
View Code? Open in Web Editor NEWSpeech Note Linux app. Note taking, reading and translating with offline Speech to Text, Text to Speech and Machine translation.
License: Mozilla Public License 2.0
Speech Note Linux app. Note taking, reading and translating with offline Speech to Text, Text to Speech and Machine translation.
License: Mozilla Public License 2.0
OK the most basic voice uses almost no processing power, and the very best voices, use loads of processing power and unless you have a very powerful computer, it's lagging and buffering a lot.
The text to speech - save audio file to MP3 - that is great, but the need for processing power - if it does not exist - then it can take 12 to 18 hours to do a high rate conversion of a 400 page document to MP3.
So the voice names need a scale beside them - I figure that the small, medium and large designations MIGHT be linked to a data rate, but they might be linked to a download file size...
For most of my work I have to read LARGE documents, like 400 pages etc.. and it's better to read them out, and save them as an MP3, so I can listen to them when driving long distances or when resting etc..
I don't need stereo phonic high fidelity... just low resolution audio... that is fine...
I also lack computers that are much beyond office work and playing a few videos.. So the down scale options are needed.. "Oh voice X uses 200 times the resources of E-Speak Robot.. Hmmmm brilliant, but I will be happy with 25 times the processing power of E-speak robot...
I am REALLY impressed with what you all have done so far... It's incredible... I mean this is really good.
HI,
thanks for this awesome APP.
Very useful for students and teachers and for students with some problems.
I would suggest entering various reading speeds when text is read.
Also, the ability to export audio to other formats like MP3,ogg,etc.
Thank you.
V/R,
A.
Hello,
I really appreciate your project! I think it's going in a very nice and useful direction!
LiveCaptions uses aprilasr, which is very fast and only needs the CPU.
I think it would be great if you could add aprilasr as one of the speech recognition options in your project.
It would add a lot of value to your project by offering a fast and lightweight option for user's who don't have access to GPUs or want to conserve battery life on mobile devices.
Thanks in advance! Good luck with the rest of the project ;)
Flatpak is a great package format but has few limitations. The major ones are as follows:
--device=all
to start workingNot-sandboxed package formats for consideration:
Are there any upcoming plans to introduce a feature that enables DS Note to seamlessly insert dictated text into a any selected text box/where the cursor is -- , similar to the functionality found in Windows where you can simply press Windows + H?
As someone with SEVERELY limited dexterity and mobility due to a disability, this function is crucial for me to do my normal working day-to-day and personally, a big barrier for me in making a full-time switch from Windows to Linux -- especially when I need to work. Unfortunately, I lack the programming skills or the capacity to grasp anything more complex than a basic "Hello, World!" program. I'm curious to know if such a feature is feasible within the DS Note program.
But for what's it worth right now, just having something on flathub where I can have something similar at least, is a gamechanger.
Selecting GPU to transcribe an audio file is causing a crash
QIBusPlatformInputContext: invalid portal bus.
QSocketNotifier: Can only be used with threads started with QThread
qt.qpa.qgnomeplatform: Could not find color scheme ""
whisper_init_from_file_no_state: loading model from '/home/user/.var/app/net.mkiol.SpeechNote/cache/net.mkiol/dsnote/speech-models/en_whisper_small.ggml'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 768
whisper_model_load: n_text_head = 12
whisper_model_load: n_text_layer = 12
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 9
whisper_model_load: qntvr = 2
whisper_model_load: type = 3
whisper_model_load: mem required = 459.00 MB (+ 16.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 180.95 MB
ggml_opencl: selecting platform: 'Clover'
ggml_opencl: selecting device: 'AMD Radeon RX 6800M (navi22, LLVM 15.0.7, DRM 3.54, 6.5.5-1-linux)'
ggml_opencl: device FP16 support: false
ggml_opencl: kernel compile error:
fatal error: cannot open file '/usr/lib/x86_64-linux-gnu/GL/default/share/clc/gfx1031-amdgcn-mesa-mesa3d.bc': No such file or directory
Hello,
the new version of your program is really nice. Is it possible to spend you money for your hard work?
I look into the github page but i dont see any button.
:)
I'm not sure why, but does seem to be related to flatpak
On system:
NVIDIA-SMI 535.113.01 Driver Version: 535.113.01 CUDA Version: 12.2
NVIDIA GeForce 940M (2GB VRAM) (Should be enough to run the small whisper)
On flatpak:
nvidia-535-104-05 org.freedesktop.Platform.GL.nvidia-535-104-05 1.4 user
nvidia-535-113-01 org.freedesktop.Platform.GL.nvidia-535-113-01 1.4 user
nvidia-535-98 org.freedesktop.Platform.GL.nvidia-535-98 1.4 user
nvidia-535-104-05 org.freedesktop.Platform.GL32.nvidia-535-104-05 1.4 user
nvidia-535-113-01 org.freedesktop.Platform.GL32.nvidia-535-113-01 1.4 user
nvidia-535-98 org.freedesktop.Platform.GL32.nvidia-535-98 1.4 user
Logs
[D] 14:13:45.593 0x7f5d825ff600 process_buff:226 - vad: no speech
[D] 14:13:45.593 0x7f5d825ff600 set_processing_state:430 - processing state: idle => decoding
[D] 14:13:45.593 0x7f5d825ff600 set_speech_detection_status:508 - speech detection status: speech-detected => decoding (no-speech)
[D] 14:13:45.593 0x7f5d825ff600 () - service refresh status, new state: listening-single-sentence
[D] 14:13:45.593 0x7f5d825ff600 () - task state changed: 1 => 2
[D] 14:13:45.593 0x7f5d825ff600 process_buff:284 - speech frame: samples=51360
[D] 14:13:45.593 0x7f5d825ff600 decode_speech:350 - speech decoding started
[D] 14:13:45.597 0x7f5de77bbd80 () - app task state: speech-detected => processing
CUDA error 209 at /run/build/whispercpp-cublas/ggml-cuda.cu:6102: no kernel image is available for execution on the device
[W] 14:13:46.168 0x7f5d825ff600 () - QObject::killTimer: Timers cannot be stopped from another thread
[W] 14:13:46.169 0x7f5d825ff600 () - QObject::~QObject: Timers cannot be stopped from another thread
[D] 14:13:46.178 0x7f5d825ff600 () - speech service dtor
[W] 14:13:46.179 0x7f5d825ff600 () - QtDBus: cannot relay signals from parent speech_service(0x5647aeab6ea0 "") unless they are emitted in the object's thread QThread(0x5647af143ed0 ""). Current thread is QThread(0x7f5d5c0016e0 "").
[D] 14:13:46.179 0x7f5d825ff600 () - mic source dtor
[W] 14:13:46.179 0x7f5d825ff600 () - QObject::killTimer: Timers cannot be stopped from another thread
why flatpak app is so big?
does it have whisper and other apps included?
is it not better to move those to download mode just like language data?
Hello,
thank you for this amazing program! It would be nice, if you can add a pause button for TTS.
Have a nice day.
Also, we could add Large language models to the application. Starting with smaller models. And adding bigger ones over time.
This could be really helpful, as LLM's that are open source that replace ChatGPT, are getting more powerful every day.
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
What are your thoughts on this?
Is this feasible?
If you have a language and language model in place, and your only interest is in changing the language model, it is confusing to have to select a language before seeing alternative language models. Some explanation would be nice.
I would love it if there was a way to create audio files via commandline for a bit more automation
Is drag and drop support for .mp3 files a possibility? Having to choose File:Transcribe a file:selecting a directory and changing from audio to all files for .mp3 to show up is tedious. A bonus would be for the name of the audio file to auto populate the text save dialog box. Maybe it could be fixed with Flatseal but I am not sure how.
Side note:
Using the whisper model gives great results. I can confirm that enabling GPU support in the settings does work as I see the GPU memory and usage spike when the transcription is occurring using Mint 21.2 and an Nvidia RTX 3050.
Would love to make a monetary contribution but I am unable to find a link unless I overlooked it.
The challenge is that as of August 23, 2023, dsnote does not support GNOME Wayland. This is a challenge because, by default, most of the recent versions of Linux distributions (distros) now use GNOME Wayland by default. Not GNOME X11. Linux's distributions, such as, but not limited to, Debian, Fedora, Manjaro, Red Hat Enterprise Linux, Ubuntu, etc.
The suggested resolution is to configure your Flatpak package appropriately so that it supports Wayland. The end result is that both GNOME Wayland and X11 are supported. If you're interested in this, this documentation about Flatpak sandbox might be useful. If somehow this documentation is not available, this archived page might be of interest. Alternatively, Flatpak support for maintainers is available here.
Below is the same as above. But with details if you're interested in those.
QSocketNotifier: Can only be used with threads started with QThread
qt.qpa.wayland: Creating a fake screen in order for Qt not to crash
qt.qpa.qgnomeplatform: Could not find color scheme ""
Complété
https://flathub.org/fr/apps/net.mkiol.SpeechNote
If needed, both me and the Ubertus.org team would be happy to contribute beta testing and documentation for this improvement or new feature. Any volunteer for a patch?
How it looks when it hangs.
If I first move the file to Downloads and then select it, it will start transcribing.
Sorry for how long this is, I don't really know what's useful here...
[chrisshaw@chris-fedora ~]$ flatpak run net.mkiol.SpeechNote --verbose
QSocketNotifier: Can only be used with threads started with QThread
qt.qpa.qgnomeplatform: Could not find color scheme ""
[I] 13:28:20.174 0x7f658be10d80 init:49 - logging to stderr enabled
[D] 13:28:20.174 0x7f658be10d80 () - translation: "en_US"
[W] 13:28:20.174 0x7f658be10d80 () - failed to install translation
[D] 13:28:20.174 0x7f658be10d80 () - starting standalone app
[D] 13:28:20.175 0x7f658be10d80 () - app: net.mkiol dsnote
[D] 13:28:20.175 0x7f658be10d80 () - config location: "/home/chrisshaw/.var/app/net.mkiol.SpeechNote/config"
[D] 13:28:20.175 0x7f658be10d80 () - data location: "/home/chrisshaw/.var/app/net.mkiol.SpeechNote/data/net.mkiol/dsnote"
[D] 13:28:20.175 0x7f658be10d80 () - cache location: "/home/chrisshaw/.var/app/net.mkiol.SpeechNote/cache/net.mkiol/dsnote"
[D] 13:28:20.175 0x7f658be10d80 () - settings file: "/home/chrisshaw/.var/app/net.mkiol.SpeechNote/config/net.mkiol/dsnote/settings.conf"
[D] 13:28:20.176 0x7f658be10d80 () - available styles: ("Default", "Fusion", "Imagine", "Material", "org.kde.breeze", "org.kde.desktop", "Plasma", "Universal")
[D] 13:28:20.176 0x7f658be10d80 () - style paths: ("/usr/lib/qml/QtQuick/Controls.2")
[D] 13:28:20.176 0x7f658be10d80 () - switching to style: "org.kde.desktop"
[D] 13:28:20.343 0x7f658be10d80 () - supported audio input devices:
ALSA lib ../../oss/pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
[D] 13:28:20.359 0x7f658be10d80 () - "pulse"
[D] 13:28:20.427 0x7f658be10d80 () - "upmix"
[D] 13:28:20.588 0x7f658be10d80 () - "default"
ALSA lib ../../../src/pcm/pcm_direct.c:2045:(snd1_pcm_direct_parse_open_conf) The field ipc_gid must be a valid group (create group audio)
ALSA lib ../../../src/pcm/pcm_direct.c:2045:(snd1_pcm_direct_parse_open_conf) The field ipc_gid must be a valid group (create group audio)
[D] 13:28:20.598 0x7f658be10d80 () - "alsa_input.usb-046d_HD_Pro_Webcam_C920_2AE889FF-02.analog-stereo"
[D] 13:28:20.598 0x7f658be10d80 () - "alsa_output.pci-0000_00_1f.3.analog-stereo.monitor"
[D] 13:28:20.598 0x7f658be10d80 () - "alsa_input.pci-0000_00_1f.3.analog-stereo"
[D] 13:28:20.598 0x7f658be10d80 add_cuda_devices:226 - scanning for cuda devices
[D] 13:28:20.601 0x7f658be10d80 add_cuda_devices:235 - cuda version: driver=0, runtime=12020
[D] 13:28:20.601 0x7f658be10d80 add_cuda_devices:240 - cudaGetDeviceCount returned: 35
[D] 13:28:20.601 0x7f658be10d80 add_hip_devices:263 - scanning for hip devices
[D] 13:28:20.601 0x7f658be10d80 hip_api:170 - failed to open hip lib: libamdhip64.so: cannot open shared object file: No such file or directory
[D] 13:28:20.601 0x7f658be10d80 add_opencl_devices:300 - scanning for opencl devices
[D] 13:28:20.812 0x7f658be10d80 add_opencl_devices:317 - opencl number of platforms: 2
[D] 13:28:20.812 0x7f658be10d80 add_opencl_devices:342 - opencl platform: 0, name=Clover, vendor=Mesa
[D] 13:28:20.812 0x7f658be10d80 add_opencl_devices:356 - opencl number of devices: 0
[D] 13:28:20.812 0x7f658be10d80 add_opencl_devices:342 - opencl platform: 1, name=AMD Accelerated Parallel Processing, vendor=Advanced Micro Devices, Inc.
[D] 13:28:20.812 0x7f658be10d80 add_opencl_devices:356 - opencl number of devices: 0
[D] 13:28:20.815 0x7f6563fff600 loop:58 - py executor loop started
[D] 13:28:20.851 0x7f658be10d80 () - starting service: app-standalone
[D] 13:28:20.858 0x7f65621fe600 () - config version: 34 34
[D] 13:28:20.860 0x7f65621fe600 () - checksum ok: "6571cb18" "en_whisper_base.ggml"
[D] 13:28:20.860 0x7f65621fe600 () - found model: "en_whisper_base"
[D] 13:28:20.863 0x7f65621fe600 () - found model: "am_espeak_am"
[D] 13:28:20.863 0x7f65621fe600 () - found model: "ar_espeak_ar"
[D] 13:28:20.863 0x7f65621fe600 () - found model: "bg_espeak_bg"
[D] 13:28:20.863 0x7f65621fe600 () - found model: "bs_espeak_bs"
[D] 13:28:20.863 0x7f65621fe600 () - found model: "ca_espeak_ca"
[D] 13:28:20.863 0x7f65621fe600 () - found model: "cs_espeak_cs"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "da_espeak_da"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "de_espeak_de"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "el_espeak_el"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "en_espeak_en"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "eo_espeak_eo"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "es_espeak_es"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "et_espeak_et"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "eu_espeak_eu"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "is_espeak_is"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "fa_espeak_fa"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "fi_espeak_fi"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "fr_espeak_fr"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "hi_espeak_hi"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "hr_espeak_hr"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "hu_espeak_hu"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "id_espeak_id"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "it_espeak_it"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "ja_espeak_ja"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "kk_espeak_kk"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "ko_espeak_ko"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "lv_espeak_lv"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "lt_espeak_lt"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "mk_espeak_mk"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "ms_espeak_ms"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "ne_espeak_ne"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "nl_espeak_nl"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "no_espeak_no"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "pt_espeak_pt"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "pt_espeak_pt_br"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "ro_espeak_ro"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "ru_espeak_ru"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "sk_espeak_sk"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "sl_espeak_sl"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "sr_espeak_sr"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "sv_espeak_sv"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "sw_espeak_sw"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "th_espeak_th"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "tr_espeak_tr"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "uk_espeak_uk"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "ka_espeak_ka"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "ky_espeak_ky"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "la_espeak_la"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "tt_espeak_tt"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "sq_espeak_sq"
[D] 13:28:20.864 0x7f65621fe600 () - found model: "uz_espeak_uz"
[D] 13:28:20.864 0x7f658be10d80 () - module already unpacked: "rhvoicedata"
[D] 13:28:20.865 0x7f65621fe600 () - found model: "vi_espeak_vi"
[D] 13:28:20.865 0x7f65621fe600 () - found model: "zh_espeak_yue"
[D] 13:28:20.865 0x7f65621fe600 () - found model: "zh_espeak_hak"
[D] 13:28:20.865 0x7f65621fe600 () - found model: "zh_espeak_cmn"
[D] 13:28:20.865 0x7f65621fe600 () - found model: "ga_espeak_ga"
[D] 13:28:20.865 0x7f65621fe600 () - found model: "mt_espeak_mt"
[D] 13:28:20.865 0x7f65621fe600 () - found model: "bn_espeak_bn"
[D] 13:28:20.865 0x7f65621fe600 () - found model: "pl_espeak_pl"
[D] 13:28:20.865 0x7f658be10d80 () - module already unpacked: "rhvoiceconfig"
[D] 13:28:20.868 0x7f65621fe600 () - models changed
[D] 13:28:20.876 0x7f658be10d80 () - module already unpacked: "espeakdata"
[D] 13:28:20.877 0x7f658be10d80 () - default tts model not found: "en"
[D] 13:28:20.877 0x7f658be10d80 () - default mnt lang not found: "en"
[D] 13:28:20.877 0x7f658be10d80 () - new default mnt lang: "en"
[D] 13:28:20.877 0x7f658be10d80 () - service refresh status, new state: idle
[D] 13:28:20.877 0x7f658be10d80 () - service state changed: unknown => idle
[D] 13:28:21.115 0x7f658be10d80 () - starting app: app-standalone
[D] 13:28:21.115 0x7f658be10d80 () - app service state: unknown => idle
[D] 13:28:21.115 0x7f658be10d80 () - app stt available models: 0 => 1
[D] 13:28:21.115 0x7f658be10d80 () - update listen
[D] 13:28:21.115 0x7f658be10d80 () - app active stt model: "" => "en_whisper_base"
[D] 13:28:21.115 0x7f658be10d80 () - update listen
[W] 13:28:21.116 0x7f658be10d80 () - no available mnt langs
[W] 13:28:21.116 0x7f658be10d80 () - no available mnt out langs
[W] 13:28:21.116 0x7f658be10d80 () - no available tts models for in mnt
[W] 13:28:21.116 0x7f658be10d80 () - no available tts models for out mnt
[W] 13:28:21.116 0x7f658be10d80 () - invalid task, reseting task state
[D] 13:28:21.116 0x7f658be10d80 () - app stt configured: false => true
logger error: invalid format string
qrc:/qml/main.qml:165:5: QML Connections: Implicitly defined onFoo properties in Connections are deprecated. Use this syntax instead: function onFoo(<arguments>) { ... }
logger error: invalid format string
qrc:/qml/main.qml:156:5: QML Connections: Implicitly defined onFoo properties in Connections are deprecated. Use this syntax instead: function onFoo(<arguments>) { ... }
logger error: invalid format string
qrc:/qml/Notepad.qml:24:5: QML Connections: Implicitly defined onFoo properties in Connections are deprecated. Use this syntax instead: function onFoo(<arguments>) { ... }
logger error: invalid format string
qrc:/qml/Translator.qml:29:5: QML Connections: Implicitly defined onFoo properties in Connections are deprecated. Use this syntax instead: function onFoo(<arguments>) { ... }
[D] 13:28:21.309 0x7f658be10d80 onCompleted:85 - default font pixel size: 14
[D] 13:28:21.328 0x7f658be10d80 () - default tts model not found: "en"
[D] 13:28:21.328 0x7f658be10d80 () - default mnt lang not found: "en"
[D] 13:28:21.328 0x7f658be10d80 () - new default mnt lang: "en"
[D] 13:28:21.328 0x7f658be10d80 () - service refresh status, new state: idle
[D] 13:28:21.328 0x7f658be10d80 () - service refresh status, new state: idle
[W] 13:28:21.380 0x7f658be10d80 ():164 - qrc:/qml/Translator.qml:164:9: QML ColumnLayout (parent or ancestor of QQuickLayoutAttached): Binding loop detected for property "preferredWidth"
[D] 13:28:21.524 0x7f658be10d80 () - stt models changed
[D] 13:28:21.525 0x7f658be10d80 () - update listen
[D] 13:28:21.525 0x7f658be10d80 () - tts models changed
[D] 13:28:21.525 0x7f658be10d80 () - update listen
[W] 13:28:21.525 0x7f658be10d80 () - no available tts models for in mnt
[W] 13:28:21.525 0x7f658be10d80 () - no available tts models for out mnt
[D] 13:28:21.525 0x7f658be10d80 () - ttt models changed
[D] 13:28:21.526 0x7f658be10d80 () - mnt langs changed
[D] 13:28:21.526 0x7f658be10d80 () - update listen
[W] 13:28:21.526 0x7f658be10d80 () - no available mnt langs
[W] 13:28:21.526 0x7f658be10d80 () - no available mnt out langs
[D] 13:28:35.806 0x7f658be10d80 () - default tts model not found: "en"
[D] 13:28:35.807 0x7f658be10d80 () - default mnt lang not found: "en"
[D] 13:28:35.807 0x7f658be10d80 () - new default mnt lang: "en"
[D] 13:28:35.807 0x7f658be10d80 () - choosing model for id: "en_whisper_base" "en"
[D] 13:28:35.807 0x7f658be10d80 () - restart stt engine config: "lang=en, model-files=[model-file=/home/chrisshaw/.var/app/net.mkiol.SpeechNote/cache/net.mkiol/dsnote/speech-models/en_whisper_base.ggml, scorer-file=, ttt-model-file=], speech-mode=automatic, vad-mode=aggressiveness-3, speech-started=0, use-gpu=0, gpu-device=[id=-1, api=opencl, name=, platform-name=]"
[D] 13:28:35.807 0x7f658be10d80 () - new stt engine required
[D] 13:28:35.808 0x7f658be10d80 open_whisper_lib:109 - using whisper-openblas
[D] 13:28:37.109 0x7f658be10d80 make_wparams:340 - cpu info: arch=x86_64, cores=4
[D] 13:28:37.110 0x7f658be10d80 make_wparams:342 - using threads: 4/4
[D] 13:28:37.110 0x7f658be10d80 make_wparams:344 - system info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | COREML = 0 | OPENVINO = 0 |
[D] 13:28:37.110 0x7f658be10d80 start:199 - starting engine
[D] 13:28:37.110 0x7f658be10d80 start:207 - engine started
[D] 13:28:37.110 0x7f658be10d80 () - creating audio source
[D] 13:28:37.110 0x7f658be10d80 () - mic source created
[D] 13:28:37.110 0x7f64fbc15600 start_processing:244 - processing started
[D] 13:28:37.110 0x7f64fbc15600 set_processing_state:430 - processing state: idle => initializing
[D] 13:28:37.110 0x7f64fbc15600 set_processing_state:437 - speech detection status: no-speech => initializing (no-speech)
[D] 13:28:37.110 0x7f64fbc15600 () - service refresh status, new state: idle
[D] 13:28:37.110 0x7f64fbc15600 () - task state changed: 0 => 3
[D] 13:28:37.110 0x7f64fbc15600 create_whisper_model:175 - creating whisper model
whisper_init_from_file_no_state: loading model from '/home/chrisshaw/.var/app/net.mkiol.SpeechNote/cache/net.mkiol/dsnote/speech-models/en_whisper_base.ggml'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 9
whisper_model_load: qntvr = 2
whisper_model_load: type = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 56.51 MB
[D] 13:28:37.340 0x7f658be10d80 () - using audio input: "alsa_input.usb-046d_HD_Pro_Webcam_C920_2AE889FF-02.analog-stereo"
whisper_model_load: model size = 56.38 MB
whisper_init_state: kv self size = 5.25 MB
whisper_init_state: kv cross size = 17.58 MB
whisper_init_state: compute buffer (conv) = 14.10 MB
whisper_init_state: compute buffer (encode) = 81.85 MB
whisper_init_state: compute buffer (cross) = 4.40 MB
whisper_init_state: compute buffer (decode) = 24.61 MB
[D] 13:28:37.440 0x7f64fbc15600 create_whisper_model:185 - whisper model created
[D] 13:28:37.440 0x7f64fbc15600 set_processing_state:430 - processing state: initializing => idle
[D] 13:28:37.440 0x7f64fbc15600 set_processing_state:437 - speech detection status: initializing => no-speech (no-speech)
[D] 13:28:37.440 0x7f64fbc15600 () - service refresh status, new state: idle
[D] 13:28:37.440 0x7f64fbc15600 () - task state changed: 3 => 0
[D] 13:28:37.657 0x7f658be10d80 () - audio state: IdleState
[D] 13:28:37.658 0x7f658be10d80 () - service refresh status, new state: listening-auto
[D] 13:28:37.658 0x7f658be10d80 () - service state changed: idle => listening-auto
[W] 13:28:37.660 0x7f658be10d80 () - ignore TaskStatePropertyChanged signal
[W] 13:28:37.660 0x7f658be10d80 () - ignore TaskStatePropertyChanged signal
[D] 13:28:37.660 0x7f658be10d80 () - app current task: -1 => 0
[W] 13:28:37.660 0x7f658be10d80 () - invalid task, reseting task state
[D] 13:28:37.660 0x7f658be10d80 () - app service state: idle => listening-auto
[W] 13:28:37.664 0x7f658be10d80 () - no available mnt langs
[W] 13:28:37.664 0x7f658be10d80 () - no available mnt out langs
[W] 13:28:37.664 0x7f658be10d80 () - no available tts models for in mnt
[W] 13:28:37.664 0x7f658be10d80 () - no available tts models for out mnt
[W] 13:28:37.664 0x7f658be10d80 () - invalid task, reseting task state
[D] 13:28:37.847 0x7f658be10d80 () - audio state: ActiveState
[D] 13:28:39.178 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=true, eof=false
[D] 13:28:39.210 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:40.762 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:40.795 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:42.162 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:42.194 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:43.561 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:43.597 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:45.162 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:45.201 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:46.561 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
** (dsnote:2): WARNING **: 13:28:46.596: atk-bridge: get_device_events_reply: unknown signature
[D] 13:28:46.600 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:48.162 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:48.202 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:49.762 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:49.800 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:51.162 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:51.200 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:52.762 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:52.797 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:54.162 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:54.175 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:55.561 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:55.593 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:57.162 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:57.184 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:28:58.762 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:28:58.774 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:29:00.164 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:29:00.181 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:29:01.762 0x7f64fbc15600 process_buff:195 - process samples buf: mode=automatic, in-buf size=24000, speech-buf size=0, sof=false, eof=false
[D] 13:29:01.798 0x7f64fbc15600 process_buff:226 - vad: no speech
[D] 13:29:02.215 0x7f658be10d80 () - cancel
[D] 13:29:02.215 0x7f658be10d80 () - stop stt engine
[D] 13:29:02.215 0x7f658be10d80 stop:225 - stop requested
[D] 13:29:02.215 0x7f658be10d80 stop_processing_impl:166 - whisper cancel
[D] 13:29:02.215 0x7f64fbc15600 flush:446 - flush: exit
[D] 13:29:02.215 0x7f64fbc15600 reset_in_processing:356 - reset in processing
[D] 13:29:02.215 0x7f64fbc15600 start_processing:279 - processing ended
[D] 13:29:02.215 0x7f658be10d80 stop:240 - stop completed
[D] 13:29:02.215 0x7f658be10d80 () - mic source dtor
[D] 13:29:02.215 0x7f658be10d80 () - audio state: SuspendedState
[D] 13:29:02.215 0x7f658be10d80 () - audio ended
[D] 13:29:02.217 0x7f658be10d80 () - service refresh status, new state: idle
[D] 13:29:02.217 0x7f658be10d80 () - service state changed: listening-auto => idle
[D] 13:29:02.217 0x7f658be10d80 () - service refresh status, new state: idle
[D] 13:29:02.217 0x7f658be10d80 () - app current task: 0 => -1
[W] 13:29:02.217 0x7f658be10d80 () - invalid task, reseting task state
[D] 13:29:02.217 0x7f658be10d80 () - app service state: listening-auto => idle
[W] 13:29:02.221 0x7f658be10d80 () - no available mnt langs
[W] 13:29:02.221 0x7f658be10d80 () - no available mnt out langs
[W] 13:29:02.221 0x7f658be10d80 () - no available tts models for in mnt
[W] 13:29:02.221 0x7f658be10d80 () - no available tts models for out mnt
[W] 13:29:02.221 0x7f658be10d80 () - invalid task, reseting task state
Hello,
I really appreciate your project! I think it's going in a very nice and useful direction!
I note that you support the Coqui STT, Vosk and whisper.cpp engines.
Would it be possible to add guillaumekln's fasterwhisper STT engine? (Here)
FasterWhisper has the advantage of being incredibly faster than whisper.cpp, while consuming relatively little extra RAM (the differences are shown in a table on its github).
So I think it would be a great idea! The models have, if I've understood correctly, been modified but are available on HuggingFace (again, everything is very well indicated on its github).
Thanks in advance! Good luck with the rest of the project ;)
Breizhux
I have installed and run SpeechNote from flatpak. It starts up fine, but as soon as I press Listen, it loads the speech model and crashes.
$ flatpak run net.mkiol.SpeechNote
Gtk-Message: 13:44:53.142: Failed to load module "xapp-gtk3-module"
Qt: Session management error: Authentication Rejected, reason : None of the authentication protocols specified are supported and host-based authentication failed
I select the Speech to text model and press Listen
whisper_init_from_file_no_state: loading model from '/home/user/.var/app/net.mkiol.SpeechNote/cache/net.mkiol/dsnote/speech-models/multilang_whisper_base.ggml'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: mem required = 218,00 MB (+ 6,00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx = 140,60 MB
And SpeechNote crashes.
The same situation occurs for each selected speech model.
(Linux Mint 21.1, Xfce 4.16)
hello,
enter for reading
p for pause
and more if you like.
Maybe you can add some things in the settings to config all hotkeys.
:)
Speech note is an excellent software that can solve a lot of my tasks. A small improvement proposal on my part would be the implementation of a spell check (e.g., Hunspell, Aspell) in the notepad. This would be very useful, for example, if you want to have text translated and make sure that there are no unnecessary errors before translation due to small typos. Probably the best solution would be an integration of grammar checks via LanguageTool (remote API or local server).
Flatpak
4.1.0
For now i tested only following engines:
Espeak and Piper works for every text so far. Coqui and RHVoice can't read text if there's at least one new line.
Cause is probably that for newline it's creating empty task.
[D] 20:14:48.26 0x7fc8df77ed80 encode_speech:174 - task: SENTENCE_BEFORE_NEW_LINE
[D] 20:14:48.26 0x7fc8df77ed80 encode_speech:174 - task:
[D] 20:14:48.26 0x7fc8df77ed80 encode_speech:174 - task: SENTENCE_AFTER_NEW_LINE
[E] 20:14:59.438 0x7fc8d09ff600 operator():260 - py error: ValueError: You need to define either `text` (for sythesis) or a `reference_wav` (for voice conversion) to use the Coqui TTS API.
In the latest version, it is not possible to select video files to transcribe the audio. Additionally, these formats are not supported:
I must first say that this project is amazing, really a game changer for me since I don't need to fiddle with conda environments in terminals to get different models working.
I am right now trying to transcribe a book with about 700 pages, since there is no audio book version, and especially the Piper Joe Medium model sounded amazing.
But it just doesn't save. It does though if I cut it in smaller chunks. I tried wav and Opus, thinking compression might have broke it, but nothing seems to make it save. It outputs a initialization error. "Error: text to speech initialization engine has failed"
Also, it refuses to initialize TTS again afterwards, and the app needs a restart.
I am on a fedora linux 38 system. I'm using the latest version of Speech Note.
Here are outputs from terminal upon trying to save the Wav file:
Same colorful text all over til the very end.
Interestingly Vorbis had the same pattern, but something different at the very end:
Thanks for this app!
I found the following issues while exploring the automation tools provided via the beta flatpak.
First, invoking any of the reading actions (start-reading
, start-reading-clipboard
, or pause-resume-reading
) through --action
cmdline option will not work, the program just prints:
Invalid action. Use one option from the following: start-listening, start-listening-active-window, start-listening-clipboard, stop-listening, start-reading, start-reading-clipboard, pause-resume-reading, cancel.
Second, I didn't have any problem using the dbus org.freedesktop.Application
interface, calling ActivateAction
works perfectly fine. But I could not find what is defined in dbus/org.mkiol.Speech.xml
on the dbus session, it seems that powerful interface isn't exposed at all. Is this normal?
Some parts of Speechnote are note well translated.
I would like to help. I forked the git. Can I just use git, or do you have an other tool for translation?
I just installed the software through flat hub and it does not produce a main menu icon in my start menu, I am using Zorin 16 os. I can get it running through the command but there is no entry in my start menu.
As whisper is now supported (great stuff, thank You) it would be really cool if one could tick a box maybe and use the ability of whisper to translate to english, would be really handy when going abroad to be able just record people speaking local language and get instant translation
Speech Note enables you to take and read notes with your voice with multiple languages. It uses Speech to Text and Text to Speech conversions to do so. All voice processing is entirely done off-line, locally on your computer without the use of network connection. Your privacy is always respected. No data is send to the Internet.
a network connection
sent
I am impressed with the Piper Ryan high and Piper Lessac high version is very human sounding. And I am impressed that this little program makes all the voices easily accessible.
In Speech Notes' settings, when changing the directory where the Deep Speech models are stored, the harbour-dsnote.service isn't restarted and keep looking at the old (wrong) path.
Context:
/home
)Current work-around:
systemctl --user restart harbour-dsnote.service
Request:
I find this is a brilliant app for Linux. But, is there some way to fix how certain names or words are pronounced?
I notice the only way i can find your git repo is via the FlatHub package.
Would you be willing to change the name to "SpeechNote" as its better SEO as your tool will likely come up better on search and more people will find it i think.
Let me know what you think.
Hello,
I got the "Speech Note" Flatpak working on my Debian 12 system (Zbook Studio G5). I can use Whisper in offline mode here. After downloading Whisper (large and/or medium), the speech recognition is quite good, but very slow (50 sec.). GPU acceleration would help, so I installed the Nivida drivers for my P1000. They work just fine with games, eg., but not with "Speech Note" and Whisper. Any ideas how to fix this? How do I get my Nvidia card to accelerate the speech recognition of whisper on Debian 12? Maybe this is a bug?
My Nvidia Driver Version: 525.125.06
I have already libcudart11.0 and nvidia-cuda-toolkit installed.
I tried both Wayland and X11.
My Card, the P1000, seems to supports CUDA 6.1 - this should be enough?
Terminal output, when starting Speech Note:
flatpak run net.mkiol.SpeechNote
QSocketNotifier: Can only be used with threads started with QThread
qt.qpa.qgnomeplatform: Could not find color scheme ""
ALSA lib ../../oss/pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib ../../../src/pcm/pcm_direct.c:2045:(snd1_pcm_direct_parse_open_conf) The field ipc_gid must be a valid group (create group audio)
Some screens:
![Bildschirmfoto vom 2023-10-16 19-43-06](https://github.com/mkiol/dsnote/assets/148144728/9fd7c5af-15b6-405c-bb53-d69e603fda99)
![Bildschirmfoto vom 2023-10-16 19-42-36](https://github.com/mkiol/dsnote/assets/148144728/a79775e9-9f99-43bf-b80c-36a9ed15a3a4)
Whishper allows transcribing from any URL supported by yt-dlp, it would be very nice to have this feature available for this desktop app.
Measurements for similar voice sample:
platform | libstt 1.1 | libstt 1.4 |
---|---|---|
x86_64 (AMD Ryzen 7 3700X 8-Core) | 2490 ms | 2550 ms |
aarch64 (Xperia 10 III) | 3800 ms | 4600 ms |
arm32 (Xperia 10) | 10700 ms | 21400 ms |
Flatpak
4.1.0
After clicking download, scroll is jumping to top of list (most often if while clicked it was scrolled down), then when i will click download to some model at top, scroll may jump down.
Nothing is showing on output when runned with dsnote --verbose
command, except following, but these occur only when opening languages menu.
[W] 20:33:49.225 0x7fe9ee845d80 () - OpenType support missing for "Unifont", script 12
[W] 20:33:49.312 0x7fe9d1066600 () - OpenType support missing for "Unifont", script 12
[W] 20:33:49.371 0x7fe9ee845d80 () - OpenType support missing for "Biwidth", script 11
[W] 20:33:49.380 0x7fe9ee845d80 () - OpenType support missing for "Fixed", script 11
[W] 20:33:49.398 0x7fe9d1066600 () - OpenType support missing for "Biwidth", script 11
[W] 20:33:49.407 0x7fe9d1066600 () - OpenType support missing for "Fixed", script 11
Hi,
It might be very useful to add open dyslexic fonts for some people who need it.
Also, import PDF files for transformation into audio files.
Thanks.
A.
I'm on OpenSUSE Tumbleweed and I'm using the Flatpak version of Speech Note.
$ flatpak run net.mkiol.SpeechNote
Qt: Session management error: Could not open network socket
ALSA lib ../../oss/pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib ../../../src/pcm/pcm_direct.c:2045:(snd1_pcm_direct_parse_open_conf) The field ipc_gid must be a valid group (create group audio)
ALSA lib ../../../src/pcm/pcm_direct.c:2045:(snd1_pcm_direct_parse_open_conf) The field ipc_gid must be a valid group (create group audio)
free(): invalid size
I would like to see the Mimic 3 models in this app
A link to the GitHub is HERE.
It does a better job than Piper in my opinion and sounds more real.
P.S Awesome Project, keep up the good work.
Hello. First of all, thank you for your work. It looks fantastic. At least, until now, I couldn't try it, since the following problem appears (I took a screenshot so you can see it: "01. Text to speech Spanish - Error") By the way, I had no problem downloading in English.
Sorry if this request is not well made, it's my first time using Github.
Thanks for your job.
Original issue #8
backtrace:
Thread 1 "dsnote" received signal SIGILL, Illegal instruction.
0x00007fffd02795a7 in ?? () from /app/lib/libkenlm.so
cpu flags:
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm pti dtherm
On stable and beta versions, it is saying that a suitable GPU isn't available. I've installed OpenCL packages on Fedora 38 and the equivalent flatpak OpenCL packages but it still says not available.
I get that it might be the case that it wouldn't be useful given it isn't a powerful discrete gpu, but wondered if it might be a bug causing it to report as unavailable.
Looking through whisper.cpp it needs 3x less memory than the original, which would make it possible to run even large model on xperia 10 III (3.3GB vs 10GB), this would probably be overkill, especially speed would suffer a lot, but adding small and medium would probably make sense
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.