jschmie / scraibe Goto Github PK
View Code? Open in Web Editor NEWTool for automatic transcription and speaker diarization based on whisper and pyannote.
Home Page: https://jschmie.github.io/ScrAIbe/
License: GNU General Public License v3.0
Tool for automatic transcription and speaker diarization based on whisper and pyannote.
Home Page: https://jschmie.github.io/ScrAIbe/
License: GNU General Public License v3.0
Both the input value and the module are named json which does not work.
Hallo nochmal und vielen Dank übrigens fürs coole Projekt😊
Using the small whisper model and "Auto Transcribe" needs almost 11gb of Vram. After transcription and diarization is done the model seems to be kept in vram at 11gb Vram. As a "GPU Poor" person I ask: Would it be possible to flush it automatically after use? is there also a way to set beam size maybe?
edit: this is weird, small uses about 11gb of vram, medium about 9gb and large uses 11gb of vram. Does it batch, or maybe share ram?
Btw: There are a few other projects with way less user friendly webuis/or just api that use faster-whisper or insanely-fast-whisper that need less vram and are also faster. Mostly by using ctranslate, batching, bettertransformer, flashattention-2, or whisper-distil models (also german).
I've already used whisper-asr-webservice which currently doesn't have diarization but tries to implement it via whisperx and wordcab-transcribe which uses Nvidia NeMo for diarization. Maybe some of these ressources are of use to you?
I've no serious programming knowledge, I just dabble a little bit. I just really like your concept of the webui. I actually tried something similar with a simple gradio interface half a year ago which transcribes, diarizes via the wordcab-transcribe api and also formats the .json and associates names. It But it never worked as robustly as I hoped and I stopped working on it due to missing time and programming knowledge. Forgive me for this wall of text, I'm just a little bit excited about the possibilities and really glad I found your project😄
via the wordcab-api it used about 4gb of Vram with spike of 10gb for the first 20 seconds (probably due to diarization) with the largev2 model and took about 2:30 min for a 22min file
so maybe there's some room for improvement?
I also tried via insanely-fast-whisper which simply uses multiple optimizations and it took about 33 seconds (same file only transcription task, segmenting added about 1:20min) with less than 8gb vram. 150min of audio in less than 5 minutes transcription + diarization (have to recheck the exact time/usage).
Traceback (most recent call last):
File "/Users/brianjking/opt/anaconda3/envs/scraibe/bin/scraibe", line 5, in <module>
from scraibe.cli import cli
File "/Users/brianjking/opt/anaconda3/envs/scraibe/lib/python3.10/site-packages/scraibe/__init__.py", line 1, in <module>
from .autotranscript import *
File "/Users/brianjking/opt/anaconda3/envs/scraibe/lib/python3.10/site-packages/scraibe/autotranscript.py", line 40, in <module>
from .diarisation import Diariser
File "/Users/brianjking/opt/anaconda3/envs/scraibe/lib/python3.10/site-packages/scraibe/diarisation.py", line 34, in <module>
from pyannote.audio import Pipeline
File "/Users/brianjking/opt/anaconda3/envs/scraibe/lib/python3.10/site-packages/pyannote/audio/__init__.py", line 29, in <module>
from .core.inference import Inference
File "/Users/brianjking/opt/anaconda3/envs/scraibe/lib/python3.10/site-packages/pyannote/audio/core/inference.py", line 34, in <module>
from pyannote.audio.core.io import AudioFile
File "/Users/brianjking/opt/anaconda3/envs/scraibe/lib/python3.10/site-packages/pyannote/audio/core/io.py", line 38, in <module>
import torchaudio
File "/Users/brianjking/opt/anaconda3/envs/scraibe/lib/python3.10/site-packages/torchaudio/__init__.py", line 1, in <module>
from torchaudio import _extension # noqa: F401
File "/Users/brianjking/opt/anaconda3/envs/scraibe/lib/python3.10/site-packages/torchaudio/_extension.py", line 67, in <module>
_init_extension()
File "/Users/brianjking/opt/anaconda3/envs/scraibe/lib/python3.10/site-packages/torchaudio/_extension.py", line 61, in _init_extension
_load_lib("libtorchaudio")
File "/Users/brianjking/opt/anaconda3/envs/scraibe/lib/python3.10/site-packages/torchaudio/_extension.py", line 51, in _load_lib
torch.ops.load_library(path)
File "/Users/brianjking/opt/anaconda3/envs/scraibe/lib/python3.10/site-packages/torch/_ops.py", line 220, in load_library
ctypes.CDLL(path)
File "/Users/brianjking/opt/anaconda3/envs/scraibe/lib/python3.10/ctypes/__init__.py", line 374, in __init__
self._handle = _dlopen(self._name, mode)
OSError: dlopen(/Users/brianjking/opt/anaconda3/envs/scraibe/lib/python3.10/site-packages/torchaudio/lib/libtorchaudio.so, 0x0006): Symbol not found: __ZN2at8internal15invoke_parallelExxxRKNSt3__18functionIFvxxEEE
Referenced from: <FDA92314-6B3C-3951-A6EA-674B8F2438DA> /Users/brianjking/opt/anaconda3/envs/scraibe/lib/python3.10/site-packages/torchaudio/lib/libtorchaudio.so
Expected in: <BAC87571-ABAB-3E0E-AC71-304C308C3507> /Users/brianjking/opt/anaconda3/envs/scraibe/lib/python3.10/site-packages/torch/lib/libtorch_cpu.dylib
@JSchmie Any ideas? Thanks!
Similar to openai/whisper#928 i noticed whisper includes common phrases hinting at training dataset bias.
The resulting segments are <1s, thus can be filtered relatively easy.
Furthermore, this seems to be only a problem when using the whisper model large-v2
.
(In the newest model large-v3
, this seems to be fixed. Available in openai-whisper >=v20231106 )
Just fyi, to built the docker image it seems that an additional "models" folder needs to be created at the root when cloning the repo.
[+] Building 0.9s (10/21) docker:default
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 1.74kB 0.0s
=> [internal] load metadata for docker.io/pytorch/pytorch:1.11.0-cuda11.3-cudnn8-r 0.8s
=> [auth] pytorch/pytorch:pull token for registry-1.docker.io 0.0s
=> [ 1/16] FROM docker.io/pytorch/pytorch:1.11.0-cuda11.3-cudnn8-runtime@sha256:99 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 133.34kB 0.0s
=> CACHED [ 2/16] WORKDIR /app 0.0s
=> CACHED [ 3/16] COPY requirements.txt /app/requirements.txt 0.0s
=> CACHED [ 4/16] COPY README.md /app/README.md 0.0s
=> ERROR [ 5/16] COPY models /app/models 0.0s
------
> [ 5/16] COPY models /app/models:
------
Dockerfile:26
--------------------
24 | COPY requirements.txt /app/requirements.txt
25 | COPY README.md /app/README.md
26 | >>> COPY models /app/models
27 | COPY scraibe /app/scraibe
28 | COPY setup.py /app/setup.py
--------------------
ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref 4A4L:KJIP:A7CG:EM6K:KMYL:AP5S:ZADH:EBS7:RIFH:FWE7:IZH2:OET5::zctq8tubo5s6j4po9qojalgm7: "/models": not found
it also seems like there is a slight error with the dockerfile. tried:
COPY requirements.txt /app/requirements.txt
COPY scraibe /app/scraibe
COPY setup.py /app/setup.py
instead of
COPY requirements.txt /app/requirements.txt
COPY scraibe /app/Scraibe
COPY setup.py /app/setup.py
otherwise this error happens:
=> ERROR [10/12] RUN pip install /app/ 0.7s
------
> [10/12] RUN pip install /app/:
0.550 Processing /app
0.550 Preparing metadata (setup.py): started
0.635 Preparing metadata (setup.py): finished with status 'error'
0.637 error: subprocess-exited-with-error
0.637
0.637 × python setup.py egg_info did not run successfully.
0.637 │ exit code: 1
0.637 ╰─> [6 lines of output]
0.637 Traceback (most recent call last):
0.637 File "<string>", line 2, in <module>
0.637 File "<pip-setuptools-caller>", line 34, in <module>
0.637 File "/app/setup.py", line 16, in <module>
0.637 with open(verfile, "r") as fp:
0.637 FileNotFoundError: [Errno 2] No such file or directory: '/app/scraibe/version.py'
0.637 [end of output]
0.637
0.637 note: This error originates from a subprocess, and is likely not a problem with pip.
0.638 error: metadata-generation-failed
0.638
0.638 × Encountered error while generating package metadata.
0.638 ╰─> See above for output.
0.638
0.638 note: This is an issue with the package mentioned above, not pip.
0.638 hint: See above for details.
------
Dockerfile:20
--------------------
18 | RUN conda install -c conda-forge libsndfile
19 | RUN pip install torchaudio==0.11.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html
20 | >>> RUN pip install /app/
21 | RUN pip install markupsafe==2.0.1 --force-reinstall
22 | RUN Scraibe --hf_token $hf_token
--------------------
ERROR: failed to solve: process "/bin/sh -c pip install /app/" did not complete successfully: exit code: 1
maybe similar problem in dockerfile with
RUN Scraibe --hf_token $hf_token
=> ERROR [12/12] RUN Scraibe --hf_token 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 0.5s
------
> [12/12] RUN Scraibe --hf_token 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 :
0.448 /bin/sh: 1: Scraibe: not found
------
Dockerfile:22
--------------------
20 | RUN pip install /app/
21 | RUN pip install markupsafe==2.0.1 --force-reinstall
22 | >>> RUN Scraibe --hf_token $hf_token
23 | # Expose port
24 | EXPOSE 7860
--------------------
ERROR: failed to solve: process "/bin/sh -c Scraibe --hf_token $hf_token" did not complete successfully: exit code: 127
after changing to
RUN scraibe --hf_token $hf_token
it runs further up to:
=> ERROR [12/12] RUN scraibe --hf_token hf_jbOynJACnotrealtokenfwBYEqCRV 2.6s
------
> [12/12] RUN scraibe --hf_token hf_jbOynJACnotrealtokenwfBYEqCRV:
2.123 Traceback (most recent call last):
2.123 File "/opt/conda/bin/scraibe", line 5, in <module>
2.123 from scraibe.cli import cli
2.123 File "/opt/conda/lib/python3.8/site-packages/scraibe/__init__.py", line 10, in <module>
2.123 from .app.gradio_app import *
2.123 File "/opt/conda/lib/python3.8/site-packages/scraibe/app/__init__.py", line 2, in <module>
2.123 from .gradio_app import *
2.123 File "/opt/conda/lib/python3.8/site-packages/scraibe/app/gradio_app.py", line 37, in <module>
2.123 from tkinter import CURRENT
2.123 File "/opt/conda/lib/python3.8/tkinter/__init__.py", line 36, in <module>
2.123 import _tkinter # If this fails your Python may not be configured for Tk
2.123 ImportError: libX11.so.6: cannot open shared object file: No such file or directory
------
Dockerfile:22
--------------------
20 | RUN pip install /app/
21 | RUN pip install markupsafe==2.0.1 --force-reinstall
22 | >>> RUN scraibe --hf_token $hf_token
23 | # Expose port
24 | EXPOSE 7860
--------------------
ERROR: failed to solve: process "/bin/sh -c scraibe --hf_token $hf_token" did not complete successfully: exit code: 1
haven't tried any further so far.
Traceback (most recent call last):
File "/home/usrPycharmProjects/autotranscript/transcribe.py", line 24, in
text = model.transcribe("test.MXF")
File "/home/usrPycharmProjects/autotranscript/autotranscript/autotranscript.py", line 78, in transcribe
diarisation = self.diariser.diarization(dia_audio,
File "/home/usr/PycharmProjects/autotranscript/autotranscript/diarisation.py", line 41, in diarization
diarization = self.model(audiofile,*args, **kwargs)
File "/home/usr/anaconda3/envs/whisper_new/lib/python3.9/site-packages/pyannote/audio/core/pipeline.py", line 238, in call
return self.apply(file, **kwargs)
File "/home/usr/anaconda3/envs/whisper_new/lib/python3.9/site-packages/pyannote/audio/pipelines/speaker_diarization.py", line 512, in apply
discrete_diarization = self.reconstruct(
File "/home/usr/anaconda3/envs/whisper_new/lib/python3.9/site-packages/pyannote/audio/pipelines/speaker_diarization.py", line 397, in reconstruct
clustered_segmentations = np.NAN * np.zeros(
ValueError: negative dimensions are not allowed
Raises when a file does not contain any speech or more precisely does not contain any audio but instead noise, fix is easy by just handle this error and finish with no transcription, but we need a more user-friendly error code.
I just found a little error in the readme example cmdline usage, where it shows:
scraibe -f "audio.wav" --language "german" --num_speakers 2
But the language options are (only):
{af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}
So it should be "German" or "de" in the example.
Also --num_speakers
is no available option for scraibe. scaribe --help
only lists the following:
-h, --help show this help message and exit
-f AUDIO_FILES [AUDIO_FILES ...], --audio-files AUDIO_FILES [AUDIO_FILES ...]
--whisper-type {whisper,whisperx}
--whisper-model-name WHISPER_MODEL_NAME
--whisper-model-directory WHISPER_MODEL_DIRECTORY
--diarization-directory DIARIZATION_DIRECTORY
--hf-token HF_TOKEN HuggingFace token for private model download. (default: None)
--inference-device INFERENCE_DEVICE
--num-threads NUM_THREADS
--output-directory OUTPUT_DIRECTORY, -o OUTPUT_DIRECTORY
--output-format {txt,json,md,html}, -of {txt,json,md,html}
--verbose-output VERBOSE_OUTPUT
--task {autotranscribe,diarization,autotranscribe+translate,translate,transcribe}
--language
But this would actually be nice, so adding this in the cli.py is probably a good idea.
Thank you for a great package! I am wondering if you plan to support speaker recognition? Given a folder with voice samples for speakers, it assigns each speaker a name rather than a placeholder.
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.