Giter Club home page Giter Club logo

thorsten-voice's Introduction

Thorsten-Voice logo

Motivation for Thorsten-Voice project 🗣️ 💬

A free to use, offline working, high quality german TTS voice should be available for every project without any license struggling.

Personal words by Thorsten Müller

I contribute my voice as a person believing in a world where all people are equal. No matter of gender, sexual orientation, religion, skin color and geocoordinates of birth location. A global world where everybody is warmly welcome on any place on this planet and open and free knowledge and education is available to everyone. 🌍 (Thorsten Müller)

Please keep in mind, that i am no professional voice talent. I'm just a normal guy sharing his voice with the world.

Social media

YouTube Channel Subscribers follow on Twitter Web

Feel free to contact me on social media 🤗.

Platform Link
Youtube ThorstenVoice on Youtube
LinkedIn Thorsten Müller on LinkedIn
Twitter ThorstenVoice on Twitter
Huggingface ThorstenVoice on Huggingface
Instagram ThorstenVoice on Instagram

Voice-Datasets

All my "Thorsten-Voice" datasets are listed and downloadable on Zenodo. Qoutation is highly appreciated in case you use them in your projects, products or papers.

Dataset DOI Link
Thorsten-Voice Dataset 2021.02 (Neutral) DOI
Thorsten-Voice Dataset 2021.06 (Emotional) DOI
Thorsten-Voice Dataset 2022.10 (Neutral) DOI
Thorsten-Voice Dataset 2023.09 (Hessisch) DOI

Thorsten-Voice Dataset 2021.02 (Neutral)

DOI

@dataset{muller_2021_5525342,
  author       = {Müller, Thorsten and
                  Kreutz, Dominik},
  title        = {Thorsten-Voice Dataset 2021.02},
  month        = sep,
  year         = 2021,
  note         = {{Please use it to make the world a better place for 
                   whole humankind.}},
  publisher    = {Zenodo},
  version      = {3.0},
  doi          = {10.5281/zenodo.5525342},
  url          = {https://doi.org/10.5281/zenodo.5525342}
}

Dataset summary

  • Recorded by Thorsten Müller
  • Optimized by Dominik Kreutz
  • LJSpeech file and directory structure
  • 22.668 recorded phrases (wav files)
  • More than 23 hours of pure audio
  • Samplerate 22.050Hz
  • Mono
  • Normalized to -24dB
  • Phrase length (min/avg/max): 2 / 52 / 180 chars
  • No silence at beginning/ending
  • Avg spoken chars per second: 14
  • Sentences with question mark: 2.780
  • Sentences with exclamation mark: 1.840

Dataset evolution

As described in the PDF document (evolution of thorsten dataset) this dataset consists of three recording phases.

  • Phase 1: Recorded with a cheap usb microphone (low quality)
  • Phase 2: Recorded with a good microphone (good quality)
  • Phase 3: Recorded with same good microphone but longer phrases (> 100 chars) (good quality)

If you want to use a dataset subset you can see which files belong to which recording phase in recording quality csv file.

Thorsten-Voice Dataset 2021.06 (Emotional)

DOI

@dataset{muller_2021_5525023,
  author       = {Müller, Thorsten and
                  Kreutz, Dominik},
  title        = {Thorsten-Voice Dataset 2021.06 emotional},
  month        = sep,
  year         = 2021,
  note         = {{Please use it to make the world a better place for 
                   whole humankind.}},
  publisher    = {Zenodo},
  version      = {2.0},
  doi          = {10.5281/zenodo.5525023},
  url          = {https://doi.org/10.5281/zenodo.5525023}
}

All emotional recordings where recorded by myself and i tried to feel and pronounce that emotion even if the phrase context does not match that emotion. Example: I pronounced the sleepy recordings in the tone i have shortly before falling asleep.

Dataset summary

  • Recorded by Thorsten Müller
  • Optimized by Dominik Kreutz
  • 300 sentences * 8 emotions = 2.400 recordings
  • Mono
  • Samplerate 22.050Hz
  • Normalized to -24dB
  • No silence at beginning/ending
  • Sentence length: 59 - 148 chars

Thorsten-Voice Dataset 2022.10 (Neutral)

DOI

🗣️ Listen to some audio recordings from this dataset here.

@dataset{muller_2022_7265581,
  author       = {Müller, Thorsten and
                  Kreutz, Dominik},
  title        = {Thorsten-Voice Dataset 2022.10},
  month        = nov,
  year         = 2022,
  publisher    = {Zenodo},
  version      = {1.0},
  doi          = {10.5281/zenodo.7265581},
  url          = {https://doi.org/10.5281/zenodo.7265581}
}

Thorsten-Voice Dataset 2023.09 (Hessisch)

DOI

@dataset{muller_2024_10511260,
  author       = {Müller, Thorsten and
                  Kreutz, Dominik},
  title        = {Thorsten-Voice Dataset 2023.09 Hessisch},
  month        = jan,
  year         = 2024,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.10511260},
  url          = {https://doi.org/10.5281/zenodo.10511260}
}

TTS Models

Based on these opensource voice datasets several TTS (text to speech) models have been trained using AI / machine learning technology.

There are multiple german models available trained and used by by the projects Coqui AI, Piper TTS and Home Assistant. You can find more information on how to use them, audio samples and video tutorials on the Thorsten-Voice project website.

Listen to audio samples and installation / usage instructions here (🇩🇪):

In addition Silero, Monatis and ZDisket used my voice datasets for model training too. More samples and details can be found on Silero Thorsten-Voice audio samples. See this colab notebook for more details.

ZDisket made a tool called TensorVox for setting up an TTS environment on Windows and included a german TTS model trained by monatis. Thanks for sharing that. See it in action on Youtube.

Support & Thanks

If you like my voice contribution and would like to support my effort for an opensource voice technology future, you can support me, if you like:

I want to say thank you to great people who supported me on this journey with nice words, support and compute power: Thanks El-Tocino, Eren Gölge, Gras64, Kris Gesling, Nmstoker, Othiele, Repodiac, SanjaESC, Synesthesiam.

Special thanks to my dear colleague, Sebastian Kraus, for supporting me with audio recording equipment and for being the creative mastermind behind the logo design and of course to the dear Dominik (@domcross) for him being so close by my side on this amazing journey.

"Thorsten-Voice" youtube channel

On my Thorsten-Voice youtube channel you can find step by step (cooking recipes) tutorial on opensource voice technology. If you're interested i'd be happy to welcome you as new subscriber on my wonderful youtube community.TS** on my little .

Conference speaker

I really like to talk about the importance of an opensource voice technology future. If you would like me to be a speaker on a conference or event i'd happy to be contacted using the Thorsten-Voice website contact form. See some of my speaker references on Thorsten-Voice website.

thorsten-voice's People

Contributors

snakers4 avatar thorstenmueller avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

thorsten-voice's Issues

Model Fine-tuning Problem

Hi Thorsten,
I'm trying to train a TTS model with my own voice in Germany. But if I train a whole new model, the size of the required data set is too large. So I want to fine-tuning your model with my voice.
Regarding this idea, I have some questions

  • how large should my data set be?
  • is it possible to use some sentences from your data set to record a small dataset?
  • Is there any requirement for the data set.

And I would like to ask if you have facing out of memory problems when training the TTS model.
Thanks for your help in advance.

Made with Thorsten-Voice 😊

As a curious person, I'm wondering what is being done with the Thorsten Voice project 😃. Regardless of whether with the original voice data sets or the ready-trained TTS voice models.

I've already gotten some cool projects via email or twitter ("ThorstenVoice"), but maybe it would be cool to collect them here.
So if I know public projects or have built them myself that use Thorsten Voice, I would be happy about a short info and a link to the project (if possible). Thank you very much 😊.

Recommendation for Training/

Hi Thorsten

hope you are well and thank you for you pervious responses.

I am using ESPNET for training TTS and using Tacotron2 for text2Mel + Parallelwavgan for decoding
I am facing some difficulties and I would appreciate it if you can give me some comments if it is not too much trouble.

First, at the moment I am only using character tokenization. So the list of tokens I am using is generally the English alphabet and a few characters. especial German characters are being mapped to English characters. (would this be sufficient for generating good quality speech? or will this significantly reduce the quality of the generated voice?)
I have difficulty using phonemes at the moment, so I am thinking about leaving conversion to phonemes for later training!

Then, as you mentioned you have 3 different quality level of wav files. I am training on all of the files... is that ok? or you recommend something else like training a subset?

Furthermore, I am training at 20250 SR (should is stick with this or change to 16k?)
and finally occasionally I saw samples with some silences in the begging/end/middle of the wav files. But there is only few ones. Do you recommend removing them?
Best regards

Broken voice output?

Hi Thorsten, we love your (digital) voice :D

I did stumble upon a bug though. The sentence "her mit dem sauerbraten, sonst mach ich kleinholz aus deiner garage" comes out like this (check for yourself, it's indescribable...).

There is a shorter test case too... just "hallo".

I'm using the TTS module from pip and voice tts_models/de/thorsten/tacotron2-DCA.

Many greetings :)

Some more goodies for you: Bugudi Hello world

Generally it seems to help to add a full stop to the sentence to avoid this bug.

TTS-Models: Download-Links broken?

Hi Thorsten - thanks for your effort!

I followed your youtube video, but unfortunately I get an error when I try to download your models.
The direct link unfortunately directs to a Huggingface login window: https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--de--thorsten--tacotron2-DDC.zip

(tts) D:\Playground\tts>tts --model_name tts_models/de/thorsten/tacotron2-DDC --text "Thorstens TTS ist super" --out_path out.wav
 > Downloading model to C:\Users\Chris\AppData\Local\tts\tts_models--de--thorsten--tacotron2-DDC
  0%|                                                                                                                                               | 0.00/29.0 [00:00<?, ?iB/s] > Error: Bad zip file - https://coqui.gateway.scarf.sh/v0.8.0_models/tts_models--de--thorsten--tacotron2-DDC.zip
Traceback (most recent call last):
  File "D:\Playground\tts\lib\site-packages\TTS\utils\manage.py", line 434, in _download_zip_file
    with zipfile.ZipFile(temp_zip_name) as z:
  File "C:\ProgramData\anaconda3\lib\zipfile.py", line 1267, in __init__
    self._RealGetContents()
  File "C:\ProgramData\anaconda3\lib\zipfile.py", line 1334, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\ProgramData\anaconda3\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\ProgramData\anaconda3\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\Playground\tts\Scripts\tts.exe\__main__.py", line 7, in <module>
  File "D:\Playground\tts\lib\site-packages\TTS\bin\synthesize.py", line 345, in main
    model_path, config_path, model_item = manager.download_model(args.model_name)
  File "D:\Playground\tts\lib\site-packages\TTS\utils\manage.py", line 303, in download_model
    self._download_zip_file(model_item["github_rls_url"], output_path, self.progress_bar)
  File "D:\Playground\tts\lib\site-packages\TTS\utils\manage.py", line 439, in _download_zip_file
    raise zipfile.BadZipFile  # pylint: disable=raise-missing-from
zipfile.BadZipFile

Could you maybe provide the Tacotron2 model somewhere else? I would like to load it directly via speechbrain anyway e.g.:

import torchaudio
from speechbrain.pretrained import Tacotron2
from speechbrain.pretrained import HIFIGAN

# Intialize TTS (tacotron2) and Vocoder (HiFIGAN)
tacotron2 = Tacotron2.from_hparams("Path_to_Thorsten_tts_model")
hifi_gan = HIFIGAN.from_hparams("path_to_hifigan")

Multispeaker-Finetuning on Single-Speaker-VITS-Model

Hi Thorsten,

ich bin gerade dabei die Welt von TTS zu entdecken und bin dabei auf deinen genialen Beitrag für die deutsche Stimme gestoßen! Jetzt habe ich auch schon ein bisschen damit rumgespielt und mit einem kleinen Datensatz meiner Stimme dein VITS Model zu einem sehr guten Ergebnis gebracht. In einem nächsten Schritt habe ich eine zweite Stimme hinzugefügt und somit ein Multispeaker-Model aus deinem VITS Model trainiert. Das hat zwar einigermaßen geklappt, jedoch ist das Ergebnis der Stimmen schlechter, als beim Finetuning der Einzelstimmen. Jetzt hier meine Frage: Gibt es eine Möglichkeit, durch ein Multispeaker-Finetuning die gleiche bzw. eine sogar bessere Qualität (da größerer Datensatz von zwei Sprechern) zu erhalten?
Ich wäre dir um deine Hilfe sehr dankbar!

BG
Jakob

Sentences truncated

Hi Torsten,

first of all thank you for all your efforts and countless hours you invested into this project. This is simply impressive!

I just started playing around with it yesterday, and after having installed and updated everything (tts should be v0.5.0), I noticed that the resulting wav files seem to be truncated somewhere in the the last word of the sentence.

I tried the following sentences with mixed results:

  • Guten Morgen, das Wetter wird heute heiter bis wolkig bei 13 Grad.
  • Guten Morgen, das Wetter wird heute heiter bis wolkig bei 13 Grad Celsius.
  • Guten Morgen. Wollen wir mal sehen, ob dieser Satz bis zum Ende gesprochen wird.

Attached you can find the sample files.
samples.zip

Any idea what could be causing this?

Thanks
Hanjo

Text file

Hi, really thanks for your effort to make a dataset. But where can I find corresponding text file

Finetuning Tacotron2 on your pretrained model

Hello Thorsten,

first of all, thank you very much for providing such valuable voice data and resources! It's really impressive work that you have done.

Regarding my project: Working with TensorFlowTTS, I would like to finetune Tacotron2 with my own data on your pretained model which I took from here. So far, I have only worked with pretrained models in .h5 format, which is probably why I'm struggling with getting your data.pkl file to "run". Somehow, I can't unpickle the data.pkl files (when loading it, I receive _pickle.UnpicklingError: A load persistent id instruction was encountered, but no persistent_load function was specified.). While doing some research, I came across this notebook which downloads your models in the desired .h5 format. Unfortunately, the download links are no longer valid (404).

Could you please point me into the right direction of how to use your model as pretrained model for finetuning? Any hint is very much appreciated. Thank you!

Beispiel Audiofile?

Hallo,
zunächste einmal vielen Dank für deine Arbeit, das ist ein großartiges Projekt und ich bin gespannt was sich daraus alles entwickelt. Es wäre aber schön wenn Du ein Beispiel-Audiofile bereitstellen würdest, damit man sich ein Bild vom Datensatz machen kann ohne alle Dateien herunter laden zu müssen. So weiß man schnell, ob die Stimme passend für das eigene Projekt ist.

Beste Grüße und Danke,
Stefan

"New" model available

If you trained a model on "Thorsten" dataset please file an issue with some information on it. Sharing a trained model is highly appreciated.

You might find this interesting. I exported the TensorFlowTTS model trained by monatis for use in the TTS program I maintain https://github.com/ZDisket/TensorVox. Now anyone with a Windows 10 machine (and a CPU with AVX support, anything made after 2010 that isn't a Xeon will do) can use your voice in a few clicks. Here's a sample video: https://youtu.be/tY6_xZnkv-A
I'll think of training another one if I have enough GPU credits leftover after my latest project, as this one combined different sources of audio.

Question with Phonemes

Hi Thorsten.
I have a question about tokenization based on phonemes ! I know this is not related to this repo but I would appreciate any help.
I have used your dataset and produced German text to speech with your dataset using Tacotron 2 + MEL GAN and it sounds great thanks to your dataset.
Now I want to try phoneme instead of characters for tokenization and sadly I don't have any knowledge on that. I have used the e-speak package to generate the tokens and it has resulted in the following list of phonemes:

phonemes

do you have any Idea if this many phonemes looks right? I have used the espeak-ng package for this process!
https://github.com/espeak-ng/espeak-ng
and If (and presumably) I am off way track! do you a resource I can read for this tokenization?

Commercial projects?

Really great work Thorsten!

Is it possible to use it in a commercial project?

And are you available (as a consultant) to support us in a new german-tts project?

Please ping me via e-mail.

// Sergey

Cut off sentences (max_decoder_steps) when using "Thorsten-DCA" model with Coqui TTS v0.1.3

When you're using my Thorsten-DCA model with @coqui-ai v0.1.3 you could encounter a problem where longer sentences are cut off and message "max_decoder_steps" is printed (when running TTS on the command line). This is a known issue and will be fixed in next 🐸TTS release (thx 👑 @erogol).

If you can't wait you can apply the fix yourself by adding a line to config.json. Location depends on your local operating system.

Go to:
Linux: ~/.local/share/tts/
Mac: ~/Library/Application Support/tts/
Win: HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Explorer\Shell Folders (Registry)

And in the subfolder "tts_models--de--thorsten--tacotron2-DCA" find line "mixed_precision": false, and add a new line below with following content: "max_decoder_steps": 2000,. This should avoid sentences from being cut off.

Problem of paths in coqui model configs (manual download)

Hi there, thank you very much for your contribution! I have succesfully run the TTS by now, but I met a few minor problems before I could finally run it.

As instructed in README, I installed TTS and ran tts-server --model_name tts_models/de/thorsten/tacotron2-DCA. Due to GFW or something like that, the models would not download automatically. So I get the models tts_models/de/thorsten/tacotron2-DCA and vocoder_models/de/thorsten/fullband-melga from link give by coqui and unzip them to the appropriate place (at least where coqui attempted to place them). After reruning, the program showed error as follows:

(venv) PS C:\Users\**mypath**\ThorstenVoice> tts --model_name tts_models/de/thorsten/tacotron2-DCA --text "Was geschehen ist geschehen, es ist geschichte."
 > tts_models/de/thorsten/tacotron2-DCA is already downloaded.
 > vocoder_models/de/thorsten/fullband-melgan is already downloaded.
 > Using model: Tacotron2
Traceback (most recent call last):
  File "c:\users\***\appdata\local\programs\python\python38\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
    exec(code, run_globals)
  File "C:\Users\**mypath**\ThorstenVoice\venv\Scripts\tts.exe\__main__.py", line 7, 
in <module>
  File "C:\**mypath**\ThorstenVoice\venv\lib\site-packages\TTS\bin\synthesize.py", line 226, in main
    synthesizer = Synthesizer(
  File "C:\Users\**mypath**\ThorstenVoice\venv\lib\site-packages\TTS\utils\synthesizer.py", line 75, in __init__
    self._load_vocoder(vocoder_checkpoint, vocoder_config, use_cuda)
  File "C:\Users\**mypath**\ThorstenVoice\venv\lib\site-packages\TTS\utils\synthesizer.py", line 162, in _load_vocoder
    self.vocoder_ap = AudioProcessor(verbose=False, **self.vocoder_config.audio)
  File "C:\Users\**mypath**\ThorstenVoice\venv\lib\site-packages\TTS\utils\audio.py", line 293, in __init__
    mel_mean, mel_std, linear_mean, linear_std, _ = self.load_stats(stats_path)
  File "C:\Users\**mypath**\ThorstenVoice\venv\lib\site-packages\TTS\utils\audio.py", line 420, in load_stats
    stats = np.load(stats_path, allow_pickle=True).item()  # pylint: disable=unexpected-keyword-arg
  File "C:\Users\**mypath**\ThorstenVoice\venv\lib\site-packages\numpy\lib\npyio.py", line 416, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/home/thorsten/___prj/tts/models/taco2/thorsten-dca/spec-stats.npy'

I studied that a bit, and found out that, if I change the audio.stats_path property in the config files of the downloaded models (namely tts/tts_models--de--thorsten--tacotron2-DCA/config.json and tts/vocoder_models--de--thorsten--fullband-melgan/config.json) from something like /home/erogol/.local/share/tts/tts_models--de--thorsten--tacotron2-DCA/scale_stats.npy to the appropriate place where scale_states.npy really is in my system, the TTS will run smoothly.

I checked both repos' issues, but it seems like no one has reported similiar ones before. I guess the automatic download process will solve the problem, so I just write it down here as a reference for future manual downloaders.

Thank you very much!

Text tokenization

Hi all
I am non-German speaker and I am trying to train a Tacotron 2 model on your dataset which by the way is a great job and contribution.
Any comments on how to do a character level tokenization on the dataset? or any tools you might point put?

Thank you

Source of Text Prompts

Hi, great work and great project! 👍

I am interested in how you curated the text prompts for the recordings in your dataset. For example, from where did you source the prompts? Did you collect prompts from multiple domains? Did you select all the prompts you found in a source, a random subset, or did you filter some out in a pre-processing step?

Thanks for any hints! Much appreciated.

Very slow

Is it normal that the voice generation takes several minutes for 4 seconds of speech?

Or is this simply due to Tacotron2?
If trained with FastSpeech2, would the generation be faster?

Help when training for new Language

Could you please give me information about training mozillatts with my own dataset? Do I need only voices and its text ?

I prepare a dataset from audio books.Like LJSpeech I have .wav 16 bit PCM Mono sounds. Do I need anything else could you tell me the steps that I should look at.

Documenting the process of building an open voice model out of audio files

Hey Thorsten,

thank you for this project. Really promising and impressive.

I did not find a detailed documentation or guide of the process you are going through to build a TTS model... I mean:

  • Setting up an environment
  • recording audio
  • optimizing audio (like Dominik does)
  • Putting (my own) audio recordings and transcriptions to the right place
  • Generating and training a TTS model
  • Using the model to generate speech / audio files from existing text

Maybe you could give a short overview or at least some links on how your training process is set up and used in development?

I'm asking because I have a huge set of my own voice from an ongoing private project and now I would like to use this to train a model for reading audio books for my little daughter. How would I do that?

Maybe this would be possible also for other existing audio sets if I knw how to prepare them and what to do to make them "compatible" with your process... Take for example https://vorleser.net - all it would take is a little ffmpeg magic and some transcription effort to choose between a wide variety of trained voice models. There is also librivox or deepspeech (e.g. https://github.com/AASHISHAG/deepspeech-german). For computing power I would probably use Google Scholar, which is free under some circumstances...

What do you think?

NumPy (Torch) issues

Hi, thanks for providing this data and documentation.

I tried following instructions for Thorsten-22.08-Tacotron2-DDC but:

$ tts-server --model_name tts_models/de/thorsten/tacotron2-DDC
C:\Python310\lib\site-packages\torchaudio\compliance\kaldi.py:22: UserWarning: Failed to initialize NumPy: module compiled against API version 0x10 but this version of numpy is 0xf (Triggered internally at ..\torch\csrc\utils\tensor_numpy.cpp:77.)
  EPSILON = torch.tensor(torch.finfo(torch.float).eps)
Traceback (most recent call last):
  File "C:\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Python310\Scripts\tts-server.exe\__main__.py", line 4, in <module>
  File "C:\Python310\lib\site-packages\TTS\server\server.py", line 102, in <module>
    synthesizer = Synthesizer(
  File "C:\Python310\lib\site-packages\TTS\utils\synthesizer.py", line 76, in __init__
    self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
  File "C:\Python310\lib\site-packages\TTS\utils\synthesizer.py", line 113, in _load_tts
    self.tts_model = setup_tts_model(config=self.tts_config)
  File "C:\Python310\lib\site-packages\TTS\tts\models\__init__.py", line 13, in setup_model
    model = MyModel.init_from_config(config, samples)
  File "C:\Python310\lib\site-packages\TTS\tts\models\tacotron2.py", line 432, in init_from_config
    tokenizer, new_config = TTSTokenizer.init_from_config(config)
  File "C:\Python310\lib\site-packages\TTS\tts\utils\text\tokenizer.py", line 188, in init_from_config
    phonemizer = get_phonemizer_by_name(config.phonemizer, **phonemizer_kwargs)
  File "C:\Python310\lib\site-packages\TTS\tts\utils\text\phonemizers\__init__.py", line 42, in get_phonemizer_by_name
    return ESpeak(**kwargs)
  File "C:\Python310\lib\site-packages\TTS\tts\utils\text\phonemizers\espeak_wrapper.py", line 91, in __init__
    raise Exception(" [!] No espeak backend found. Install espeak-ng or espeak to your system.")
Exception:  [!] No espeak backend found. Install espeak-ng or espeak to your system.

So I did

$ pip install numpy --upgrade
Requirement already satisfied: numpy in c:\python310\lib\site-packages (1.22.4)
Collecting numpy
  Using cached numpy-1.24.2-cp310-cp310-win_amd64.whl (14.8 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.22.4
    Uninstalling numpy-1.22.4:
      Successfully uninstalled numpy-1.22.4
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tts 0.8.0 requires numpy==1.22.4; python_version == "3.10", but you have numpy 1.24.2 which is incompatible.
numba 0.55.2 requires numpy<1.23,>=1.18, but you have numpy 1.24.2 which is incompatible.
Successfully installed numpy-1.24.2

but:

 $ tts-server --model_name tts_models/de/thorsten/tacotron2-DDC
Traceback (most recent call last):
  File "C:\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Python310\Scripts\tts-server.exe\__main__.py", line 4, in <module>
  File "C:\Python310\lib\site-packages\TTS\server\server.py", line 14, in <module>
    from TTS.utils.synthesizer import Synthesizer
  File "C:\Python310\lib\site-packages\TTS\utils\synthesizer.py", line 14, in <module>
    from TTS.utils.audio import AudioProcessor
  File "C:\Python310\lib\site-packages\TTS\utils\audio\__init__.py", line 1, in <module>
    from TTS.utils.audio.processor import AudioProcessor
  File "C:\Python310\lib\site-packages\TTS\utils\audio\processor.py", line 3, in <module>
    import librosa
  File "C:\Python310\lib\site-packages\librosa\__init__.py", line 211, in <module>
    from . import core
  File "C:\Python310\lib\site-packages\librosa\core\__init__.py", line 5, in <module>
    from .convert import *  # pylint: disable=wildcard-import
  File "C:\Python310\lib\site-packages\librosa\core\convert.py", line 7, in <module>
    from . import notation
  File "C:\Python310\lib\site-packages\librosa\core\notation.py", line 8, in <module>
    from ..util.exceptions import ParameterError
  File "C:\Python310\lib\site-packages\librosa\util\__init__.py", line 83, in <module>
    from .utils import *  # pylint: disable=wildcard-import
  File "C:\Python310\lib\site-packages\librosa\util\utils.py", line 10, in <module>
    import numba
  File "C:\Python310\lib\site-packages\numba\__init__.py", line 42, in <module>
    from numba.np.ufunc import (vectorize, guvectorize, threading_layer,
  File "C:\Python310\lib\site-packages\numba\np\ufunc\__init__.py", line 3, in <module>
    from numba.np.ufunc.decorators import Vectorize, GUVectorize, vectorize, guvectorize
  File "C:\Python310\lib\site-packages\numba\np\ufunc\decorators.py", line 3, in <module>
    from numba.np.ufunc import _internal
SystemError: initialization of _internal failed without raising an exception

Improve Audio Generation Speed

Hi,
first of all thank you very much for your contribution!

I'm trying to build a realtime voice assistant, for which I use different tools for stt, nlp and tts. I would love to use your voice for this, but the on the fly audio generation is a bit slow with your tacotron2 model.

I found this comparison coqui-ai/TTS#522
Is there any way to speed up the audio generation to similar values of the english models?

TypeError: can't pickle weakref objects + EOFError: Ran out of input

Ich bin beim letzten Punkt angekommen:

CUDA_VISIBLE_DEVICES="0" python TTS/mozilla_voice_tts/bin/train_vocoder.py --config_path vocoder_config.json
Ich erhalte allerdings die Fehlermeldung

 > Using CUDA:  False
 > Number of GPUs:  0
vocoder_config.json
 > Git Hash: 49fe63a
....
 > TRAINING (2020-11-29 12:11:09)
 ! Run is removed from E:/Python/tts/TTS_recipes/TTS/tests/data/TestAusgabe/melgan/pwgan-November-29-2020_12+11PM-49fe63a
Traceback (most recent call last):
  File "TTS/mozilla_voice_tts/bin/train_vocoder.py", line 654, in <module>
    main(args)
  File "TTS/mozilla_voice_tts/bin/train_vocoder.py", line 558, in main
    epoch)
  File "TTS/mozilla_voice_tts/bin/train_vocoder.py", line 104, in train
    for num_iter, data in enumerate(data_loader):
  File "E:\Anaconda\envs\umgebung\lib\site-packages\torch\utils\data\dataloader.py", line 352, in __iter__
    return self._get_iterator()
  File "E:\Anaconda\envs\umgebung\lib\site-packages\torch\utils\data\dataloader.py", line 294, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "E:\Anaconda\envs\umgebung\lib\site-packages\torch\utils\data\dataloader.py", line 801, in __init__
    w.start()
  File "E:\Anaconda\envs\umgebung\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "E:\Anaconda\envs\umgebung\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "E:\Anaconda\envs\umgebung\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "E:\Anaconda\envs\umgebung\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
    reduction.dump(process_obj, to_child)
  File "E:\Anaconda\envs\umgebung\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle weakref objects

(umgebung) E:\Python\tts\TTS_recipes> > Using CUDA:  False
 > Number of GPUs:  0
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "E:\Anaconda\envs\umgebung\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "E:\Anaconda\envs\umgebung\lib\multiprocessing\spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

Ideen?

Voz Português Brazil

Hi, how are you?

Could you use this voice to train a voice for Portuguese on coqui.ai? We in Brazil don't have it, we only have it for Portugal, the pronunciation is very different. There is this model TTS-Portuguese Corpus https://github.com/Edresson/TTS-Portuguese-Corpus
He has already trained in Portuguese, but I don't know how to put it in coqui.ai to be able to use it.

Question: Use voice for voice adjustment.

Hei @thorstenMueller,

thanks for sharing this dataset with the world.
I check out the Colab notebook and I am impressed by the results.
As I want to have a TTS for my own voice, I wanted to know if I could
somehow adjust your voice, retrain the model and get similar results?

Best regards
Chris

Do you have high quality audio?

Since all the downloads are 22.05KHz, I was wondering if you had the original (usually 48, 96 or 44.1KHz) high sampling rate recordings.
From my experience it is trivial to train models that output high-fidelity audio.

ValueError: Phonemizer is not defined in the TTS config.

Hi,
I was trying to use tts_models/de/thorsten/tacotron2-DCA but I get the error

> tts_models/de/thorsten/tacotron2-DCA is already downloaded.
> vocoder_models/de/thorsten/fullband-melgan is already downloaded.
Traceback (most recent call last):
File "/home/user/python/venv/speech/bin/tts", line 8, in
sys.exit(main())
File "/home/user/python/venv/speech/lib/python3.10/site-packages/TTS/bin/synthesize.py", line 309, in main
synthesizer = Synthesizer(
File "/home/user/python/venv/speech/lib/python3.10/site-packages/TTS/utils/synthesizer.py", line 76, in init
self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
File "/home/user/python/venv/speech/lib/python3.10/site-packages/TTS/utils/synthesizer.py", line 111, in _load_tts
raise ValueError("Phonemizer is not defined in the TTS config.")
ValueError: Phonemizer is not defined in the TTS config.

The config.json says

"use_phonemes": true,
"phonemizer": null,
"phoneme_language": "de",

which sounds wrong.

Emphasis on syllables – How to choose?

Hi there,

during the last days I've been trying out the Thorsten-voice in a python virtual environment setup, as described in the German language video Freie Thorsten Stimme in LINUX lokal nutzen Text-to-Speech TTS Tutorial.

I'm amazed by the very naturally sounding voice quality. Only in some words I found the emphasis put on syllables that, in spoken German language, don't usually receive it there.

In some test phrase there was, for example, the originally English derived word "Marketing", which now got stressed on the second syllable.

Now I wondered, whether there might be any way to instruct the tts program or tts-server to put the emphasis on the first syllable.

On my web search I came across a question where the original poster said:

I know that some voice engines use special characters like + or 'in front of a stressed vowel.

I tried this suggestion several times (mainly referring to syllables, though), with different methods:

  1. directly by executing following commands:
    tts --text "Marketing." --model_name tts_models/de/thorsten/vits --out_path marketing1.wav
    tts --text "+Marketing." --model_name tts_models/de/thorsten/vits --out_path marketing2.wav
    tts --text "'Marketing." --model_name tts_models/de/thorsten/vits --out_path marketing3.wav

  2. Starting a server
    tts-server --model_name tts_models/de/thorsten/vits

and subsequently using:

a) the browser at localhost:5002/, inserting the strings
"+Marketing." (saved as marketing4.wav) and
"'Marketing." (saved as marketing5.wav).

b) curl:
curl -o marketing6.wav http://localhost:5002/api/tts?text=+Marketing.
curl -o marketing7.wav http://localhost:5002/api/tts?text=\'Marketing.

c) cTTS (Python3):
import cTTS
cTTS.synthesizeToFile("marketing8.wav", "+Marketing.")
cTTS.synthesizeToFile("marketing9.wav", "'Marketing.")

You can find the resulting sound files attached, packed in a zip file.

To my ears, there is not really much difference in them, though. The emphasis seems to rest mainly on the second syllable.

Now I'm wondering, what else I might be able to try. In case you have got any ideas or suggestions I would greatly appreciate getting to know.

Maybe I should mention, I am only doing some first steps into programming. As to my system, I am working on an up-to-date linux system (a derivative of Debian 11, without systemd). It's an older machine, though. That's probably why, at the moment, I can only use the vits model.

Thanks in advance

marketing_wav.zip

Eigene TTS erstellen

Hi Thorsten!

Leider finde ich sonst keine mir möglich Kontaktform, um Dich anzuschreiben. Ich würde evtl. auch gerne sowas machen, allerdings z.B. in Bayrisch oder Fränkisch. Wie macht man sowas?

LG
Johannes

44khz 16 bit available?

first of all, thank you so much for making this data available!
would it be possible to share the 44khz 16 bit data?

Retraining deiner Modelle mit eigener Stimme

Hi Thorsten, erstmal danke für den unfassbaren Aufwand, den du hier betreibst!

Ich würde gerne das Thema Retraining angehen, ganz konkret dein Tacotron2 Modell mit meiner Stimme nachtrainieren und schauen, wie die Ergebnisse sind. Ich komme auch aus dem Real Time Voice Cloning von Corentin Jemine, das ja versucht die Embeddings einer eigenen Stimme im Synthesizer zu verwenden, jedoch ist der Erfolg eher mau - die Embeddings haben zu wenig Einfluss, also wäre der nächste Versuch ein Retraining. Deswegen ein Versuch mit coqui.ai.

Meine Frage/Bitte ist nun konkret, ob du die Tacotron2Config zu Verfügung stellen könntest, die du für das Training verwendet hast, ich hab die nirgends in deinen Files gefunden.

Und eher eine Frage aus beiläufigem Interesse: Trainierst du auf einem Nvidia AGX? :)

Recording free emotional german dataset

Because there exist some interesting papers based on emotional speech datasets i've decided to go back to the microphone and record a free to use emotional german dataset.

I've prepared a german corpus:

  • 304 phrases
  • length between 25 and 115 chars

I'll record every phrase in following emotions:

  • Amused (Erfreut)
  • Angry (Wütend)
  • Disgusted (Angeekelt)
  • Sleepy (Schläfrig)
  • Surprised (Überrascht)
  • Normal

I'm no professional voice actor so quality might be not as good as some might expect.

Will keep you updated on recording process.

Help for vocoder training for Coqui

Hi
I am wondering if I have to train a vocoder for a new language (a new trained TTS) in Mozilla TTS. How can I find a documentation for vocoder training?

Thanks,
Neda

Windows: tts_to_file ignoring German Umlauts

Hi, when I call the tts_to_file function in Thorsten tacotron2-DDC on Windows, it seems that Coqui-TTS is ignoring the German umlauts:

from TTS.api import TTS

tts_model = TTS(model_name="tts_models/de/thorsten/tacotron2-DDC", gpu=True)
tts_model.tts_to_file("Öffentliche trübe Tragödie", file_path="voice_bot.wav")

The resulting wave file, skips the included umlauts.
According to the console output, the string is displayed correctly (no encoding issues):

 > tts_models/de/thorsten/tacotron2-DDC is already downloaded.
 > vocoder_models/de/thorsten/hifigan_v1 is already downloaded.
...
 > Text splitted to sentences.
['Öffentliche trübe Tragödie']

The issue only occurs on Windows. However, when using a different German model e.g., tts_model = TTS("tts_models/deu/fairseq/vits", gpu=True) it works fine. Furthermore, the Fairseq model explicitly outputs the vocabulary beforehand, where the German umlauts are also included:

 > tts_models/deu/fairseq/vits is already downloaded.
 > Setting up Audio Processor...
....
[' ', 'v', '2', 'q', 'g', '-', 'f', '1', '8', 'a', 'h', '4', 'ö', '3', 'r', 'm', 'ä', 'l', 'n', 't', 'ë', 'd', 'b', 'y', 'ß', 'o', 'u', '_', 'j', 's', '6', '5', 'ï', 'c', 'i', 'ü', 'p', 'k', 'e', '–', 'w', 'z', '7', 'x', '0']
 > Text splitted to sentences.
['Öffentliche trübe Tragödie']

Rhasspy German voice

Hi Thorsten, thank you for your contribution!

I'm using your dataset to train a model for Rhasspy, an open source offline voice assistant (community site). I'm using a fork of MozillaTTS called Larynx to train a GlowTTS model and a multiband melgan vocoder.

It's not done yet, but here are some samples (without vocoder): https://drive.google.com/drive/folders/1IImZKg5CES02CxKK4vk8iy9gkIyvHmMk?usp=sharing

My TTS models use a restricted set of phonemes to keep their size down, which unfortunately makes them incompatible with MozillaTTS. I created a tool called gruut to do phonemization in a different way than phonemizer (using a lexicon a pre-trained grapheme-to-phoneme model).

To get an idea of what a "finished" voice is like, see the Dutch voice I trained from rdh's dataset (also a user on the MozillaTTS Discourse site). I also released that voice as an add-on for Home Assistant ☺️

I'll post here again when the model and Docker images are ready. Thanks again!

Das Wort "Prolog" führt zu Decoder stopped with `max_decoder_steps`

Hallo Herr Müller,

das Wort "Prolog" erzeugt den Fehler Decoder stopped with max_decoder_steps. Selbst eine Erhöhung auf 4000 führt ebenfalls zu dem Fehler. (Version TTS 0.17.4)
Erzeugt wurd eine Audiodatei mittels
tts --text "Prolog" --model_name tts_models/de/thorsten/tacotron2-DDC --out_path test1.wav

Die Audiodatei ist relativ lang und nach dem Wort Prolog sind fragmente zu hören.

Audiodatei
https://github.com/thorstenMueller/Thorsten-Voice/assets/22102973/bf4caaaf-eee9-4d3d-a5f0-e70a7bff9874
Die Textausgabe des Befehls:
tts --text "Prolog" --model_name tts_models/de/thorsten/tacotron2-DDC --out_path test1.wav

tts_models/de/thorsten/tacotron2-DDC is already downloaded.
vocoder_models/de/thorsten/hifigan_v1 is already downloaded.
Using model: tacotron2
Setting up Audio Processor...
| > sample_rate:22050
| > resample:False
| > num_mels:80
| > log_func:np.log
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:False
| > symmetric_norm:True
| > mel_fmin:50.0
| > mel_fmax:None
| > pitch_fmin:0.0
| > pitch_fmax:640.0
| > spec_gain:1.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:60
| > do_sound_norm:False
| > do_amp_to_db_linear:True
| > do_amp_to_db_mel:True
| > do_rms_norm:False
| > db_level:None
| > stats_path:None
| > base:2.718281828459045
| > hop_length:256
| > win_length:1024
Model's reduction rate r is set to: 2
Vocoder Model: hifigan
Setting up Audio Processor...
| > sample_rate:22050
| > resample:False
| > num_mels:80
| > log_func:np.log10
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:False
| > symmetric_norm:True
| > mel_fmin:50.0
| > mel_fmax:None
| > pitch_fmin:1.0
| > pitch_fmax:640.0
| > spec_gain:1.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:60
| > do_sound_norm:False
| > do_amp_to_db_linear:True
| > do_amp_to_db_mel:True
| > do_rms_norm:False
| > db_level:None
| > stats_path:/home/user/.local/share/tts/vocoder_models--de--thorsten--hifigan_v1/scale_stats.npy
| > base:10
| > hop_length:256
| > win_length:1024
Generator Model: hifigan_generator
Discriminator Model: hifigan_discriminator
Removing weight norm...
Text: Prolog
Text splitted to sentences.
['Prolog']
Decoder stopped with max_decoder_steps 4000
Processing time: 200.3763484954834
Real-time factor: 2.144222194124611
Saving output to test1.wav

Voice synthesizing fails after finetuning

Hi Thorsten,

I've got a question concerning finetuning of a voice. I have finetuned a couple of voices in the past based on your german vits model. It has always worked perfectly fine. However, now I'm trying to finetune my own voice. I haven't changed any of the settings, but just changed the dataset. The dataset has roughly the same size and content as previous datasets, that I used for finetuning. After training of around 5000 steps the sound of the voice is great. But for some reason, coqui does not synthesize the text I input correctly. The resulting audio is just some gibberish that sounds like my voice but is nowhere near the text I wanted to have. Do you by any chance know what I could improve or change, or did you have similar problems in the past? The other voices I've trained do work perfectly, so I'm totally stuck and could really use some help. In case you need any files to better understand the problem please let me know.

Thanks a lot in advance!
Jakob

training duration / female voice?

thanks so much for your voice and your Youtube explanation videos, you really got me into to this stuff

now I have some specific questions:

  • how many hours of training is recommended for training on e.g. 4000 wav samples. I'm currently capped at 8 hours, and resulting voice models are either too low or too high voiced, but otherwise sounding OK (when adjusted in Audition). Will pitch improve with longer training?
  • would it be possible to finetune a female dataset on your (male) model and achieve convincing results (i.e. a woman voice)?
  • if so, what are recommended settings (I've only found a reference to Melotron as relevant so far)

Thanks in advance and keep rocking

BTW, I recommend this Whisperer repository for creating the dataset in one go: https://github.com/miguelvalente/whisperer

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.