huggingface / parler-tts Goto Github PK

View Code? Open in Web Editor NEW

2.8K 47.0 291.0 206 KB

Inference and training library for high-quality TTS models.

License: Apache License 2.0

Makefile 0.05% Python 99.95%

parler-tts's People

Stargazers

Watchers

Forkers

ylacombe ishine shaun95 splinter21 maxmax2016 techthiyanes caoyuhang statelesshz vaibhavs10 polya20 touristshaun suryatmodulus bghira not-important-vr kustomzone id-2 rjac-ml yacineali74 pan310 xzm2004260 zeroxclem shaneteks xiechengmude zhikangniu forifido apollohuang1 alignment-lab-ai pizdarikihq deepak-1530 paperwave twysto-ai haifengzeng serdarildercaglar superjili ku100ren hscspring xiaopaul trocker pypypk render-ai gkbxs ledb4 wuxiaobo zhenqicai amorjnyh btbujiangjun wauplin aplegas fmbento piperino11 qoboty yingzi6776 qingswu f901107 b08240 sunixliu allenchang6868 boveyking songfang powi-fork misterypoem hanooch74 baozhi888 backupmanager mdwoicke jakubik2023 repleeka 7xmohamed rmaster520 org-tekeli-borisp notenumber1captail peachninjanoticeiver targetcoopsh humanshangcottonhope ladyin-w chosenrockerellynnon energyouk 76-sandsly y-gotiz maxroladyin clarysf 78stamaha ysfyf alxsbr2411 joshdayax kotamadelin ysfadlaa minmin2411 hutansilon sungaiglasis gunungtravia opissroo-glasedip toxices87nepheway holyze-68 bloodsolz flavoredbubble69 alonekaven-y ratchapter-bluebanx gament-y sovy-i

parler-tts's Issues

sharing results of korean(+english) bilingual training

Hello,

Based on your code, I added Korean tokens (using a Korean emotional dataset) to the tokenizer and fine-tuned the model with the LibriTTS R dataset. The Korean dataset is slightly less than 300 hours, similar in size to LibriTTS R but with fewer speakers.
I did not perform separate emotion classification.

I wanted to share the wandb report, but due to security reasons, I am unable to do so.
Instead, I am providing the training and evaluation curves as images and some audio files as a compressed file.

Specifically, no preprocessing has been applied to the sentences in korean dataset.
The performance turned out to be better than expected, so I believe that if Korean is included in the pretraining stage, it could yield excellent results in the fine-tuning stage. I hope that Korean will be included in the next version of the model.

Thank you.

samples.zip

How do you change the sampling rate while keeping the sound quality roughly the same?

I want to change "sampling_rate": 44100 to 16000Hz but modifying the preprocessor_config.json file won't work, thanks

How work with data-set. Or question about exemple for work with data-set

Hello. I wanted to ask.

The recipe says a lot about the requirements for the data set and, as I understand it, a fairly advanced technology stack is used to assemble the data set and train the model.

But, I think that it will generally be difficult for novice users (like me) to understand how to compose and how to give to script a data set in order make their own or model based on your model.

There is no clear instruction or tool that would help people deal with their wav or mp3 files automatically without unnecessary intervention

Not everyone can use this technology stack and I wish there was an easier way recipe step-by-step. Or examples of steps on Google Colab of how you do it.

Because it’s difficult for me to immediately understand what needs to be done. Because I personally, like many who looked here, have not used Parquet - tables or DataSpeach, not much else that members of the Huggingface community use

Are there plans to open source the training data?

problem with DataCollatorParlerTTSWithPadding

Hi, I got this issue:
Traceback (most recent call last):
File "/home/neta_glazer_aiola_com/PycharmProjects/TTS/parler-tts/./training/run_parler_tts_training.py", line 1827, in
main()
File "/home/neta_glazer_aiola_com/PycharmProjects/TTS/parler-tts/./training/run_parler_tts_training.py", line 1648, in main
for batch in train_dataloader:
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/accelerate/data_loader.py", line 454, in iter
current_batch = next(dataloader_iter)
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in next
data = self._next_data()
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
return self._process_data(data)
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
data.reraise()
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/torch/_utils.py", line 543, in reraise
raise exception
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 759, in convert_to_tensors
tensor = as_tensor(value)
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 721, in as_tensor
return torch.tensor(value)
TypeError: not a sequence

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch
return self.collate_fn(data)
File "/home/neta_glazer_aiola_com/PycharmProjects/TTS/parler-tts/./training/run_parler_tts_training.py", line 559, in call
input_ids = self.description_tokenizer.pad(
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 3355, in pad
return BatchEncoding(batch_outputs, tensor_type=return_tensors)
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 224, in init
self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 775, in convert_to_tensors
raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (input_ids in this case) have excessive nesting (inputs type list where type int is expected).

i saw that
input_ids = [{"input_ids": feature["input_ids"]} for feature in features]

is list of list, so i changed it to:
input_ids = [{"input_ids": feature["input_ids"][0]} for feature in features]
and it worked

but then i got this:

Traceback (most recent call last):
File "/home/neta_glazer_aiola_com/PycharmProjects/TTS/parler-tts/./training/run_parler_tts_training.py", line 1832, in
main()
File "/home/neta_glazer_aiola_com/PycharmProjects/TTS/parler-tts/./training/run_parler_tts_training.py", line 1655, in main
loss, train_metric = train_step(batch, accelerator, autocast_kwargs)
File "/home/neta_glazer_aiola_com/PycharmProjects/TTS/parler-tts/./training/run_parler_tts_training.py", line 1584, in train_step
outputs = model(**batch)
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/accelerate/utils/operations.py", line 822, in forward
return model_forward(*args, **kwargs)
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/accelerate/utils/operations.py", line 810, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/home/neta_glazer_aiola_com/PycharmProjects/TTS/parler-tts/parler_tts/modeling_parler_tts.py", line 1995, in forward
encoder_outputs = self.text_encoder(
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1974, in forward
encoder_outputs = self.encoder(
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1015, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 160, in forward
return F.embedding(
File "/opt/conda/envs/parlenv/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

will be happy to get some help. tnx!

Does it support French?

I recently stumbled upon your project and I'm excited about its potential.
I'm wondering if there are any plans to add French language support in the future.

attention_mask

hi, I have attention_mask problem mismatch in the cross attenstion

can you please explain this line:
requires_attention_mask = "encoder_outputs" not in model_kwargs ?

why is comed after this:
if "encoder_outputs" not in model_kwargs:
# encoder_outputs are created and added to model_kwargs
model_kwargs = self._prepare_text_encoder_kwargs_for_generation(
inputs_tensor,
model_kwargs,
model_input_name,
generation_config,
)

is the attention mask is needed for the cross attnetion layer in the generation part?
this mismach problem accure only in the generator
the train & eval are ok.

tnx!

Exception in label generated

Hello, I followed the training guide and trained a parlerTTS model from scratch. First I initialized a new mod using

python helpers/model_init_scripts/init_dummy_model.py ./parler-tts-untrained-dummy --text_model "google-t5/t5-small" --audio_model "parler-tts/dac_44khZ_8kbps

After that, I referred to the parameters provided in the training guide and modified the tokenizer slightly to match the model, Specifically, this includes

--description_tokenizer_name "google-t5/t5-small" \
--prompt_tokenizer_name "google-t5/t5-small" \

At the same time I modified the batch size to initiate training.
However, there seems to be an error in generating the labels, where two of the nine channels are completely populated with 1024 (i.e. pad_token/eos_token), is this situation stemming from the model not being adequately trained?
I've attached an image showing the results generated.

how to use common voice mozilla dataset train for Parler-TTS

how to use common voice mozilla dataset train for Parler-TTS ?can you help me ?

Training on a NEW language

Suppose we have to train this TTS model on a language and the tokens of that language are not in the Flan-T5 transformer. So can I simply change the name of the tokenizer in the config.json or do I have to make any code changes also. NOTE The new tokenizer will not be of FLAN-T5

Model stumbling on its words

Running the following code:

from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1")

description = "A male speaker with a low-pitched voice delivering his words at a slow pace in a small, confined space with a very clear audio and an animated tone."
prompt = "In the annals of history, the ink that drafted peace often dried under the shadow of future conflicts. Today, we dive deep into the bottom 10 worst peace treaties ever signed, the naive hopes and the grim repercussions they bore, unraveling a tapestry of unintended consequences that would haunt nations for generations. From agreements that sowed the seeds of resentments leading to catastrophic wars, to those that carved up continents disregarding the people who lived there, we explore how peace can sometimes lead to anything but."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write(os.path.join('output.wav'), audio_arr, model.config.sampling_rate)

Outputs the .wav file posted at this link: http://sndup.net/vzyp

How can I get it to correctly output the prompt text? Is my prompt too large? Am I using the model incorrectly? Thank you!

OutOfMemoryError Encountered During Dataset Preprocessing

Hello, I have encountered an OutOfMemoryError when preprocessing my own dataset during the step of "Encoding target audio with encodec" on an A100. However, the size of my dataset is only about 1/100 of yours. When I tried to reproduce your work, everything worked fine. Specifically, I loaded the dataset from a 'json' file which contains the full paths to the audio and then converted the audio column name to Audio. Do I need to save the datasets locally or push it to the hub first? Could this be causing the issue? How can I preprocess a large dataset?

gathered_tensor tensor([0], device='cuda:0')                                                                                                                
Filter (num_proc=8): 100%|██████████| 3609/3609 [00:14<00:00, 255.72 examples/s]                                                                            
Filter (num_proc=8): 100%|██████████| 96/96 [00:13<00:00,  6.98 examples/s]                                                                                 
preprocess datasets (num_proc=8): 100%|██████████| 3589/3589 [00:13<00:00, 267.74 examples/s]                                                               
preprocess datasets (num_proc=8): 100%|██████████| 95/95 [00:14<00:00,  6.57 examples/s]                                                                    
06/18/2024 22:58:48 - INFO - __main__ - *** Encode target audio with encodec ***                                                                            
  0%|          | 0/180 [00:00<?, ?it/s]torch/nn/modules/conv.py:306: UserWarning: P
lan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Trigger
ed internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)                                                                                            
  return F.conv1d(input, weight, bias, self.stride,                                                                                                         
 20%|██        | 36/180 [00:46<04:22,  1.82s/it]torch/nn/modules/conv.py:306: UserW
arning: Plan failed with an OutOfMemoryError: CUDA out of memory. Tried to allocate 15.24 GiB. GPU  (Triggered internally at ../aten/src/ATen/native/cudnn/C
onv_v8.cpp:924.)                                                                                                                                            
  return F.conv1d(input, weight, bias, self.stride,                                                                                                         
torch/nn/modules/conv.py:306: UserWarning: Plan failed with an OutOfMemoryError: CU
DA out of memory. Tried to allocate 7.62 GiB. GPU  (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:924.)                                 
  return F.conv1d(input, weight, bias, self.stride,                                                                                                         
torch/nn/modules/conv.py:306: UserWarning: Plan failed with an OutOfMemoryError: CU
DA out of memory. Tried to allocate 30.49 GiB. GPU  (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:924.)                                
  return F.conv1d(input, weight, bias, self.stride,                                                                                                         
torch/nn/modules/conv.py:306: UserWarning: Plan failed with an OutOfMemoryError: CU
DA out of memory. Tried to allocate 22.87 GiB. GPU  (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:924.)                                
  return F.conv1d(input, weight, bias, self.stride,                                                                                                         
torch/nn/modules/conv.py:306: UserWarning: Plan failed with an OutOfMemoryError: CU
DA out of memory. Tried to allocate 15.25 GiB. GPU  (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:924.)                                
  return F.conv1d(input, weight, bias, self.stride,                                                                                                         
 21%|██        | 38/180 [00:57<03:34,  1.51s/it]                                                                                                            
Traceback (most recent call last):                                                                                                                          
  File "/parler-tts/./training/run_parler_tts_training_local.py", line 1039, in <module>                                              
    main()                                                                                                                                                  
  File "/parler-tts/./training/run_parler_tts_training_local.py", line 436, in main                                                   
    generate_labels = apply_audio_decoder(batch)                                                                                                            
  File "/parler-tts/./training/run_parler_tts_training_local.py", line 415, in apply_audio_decoder 
    generate_labels = apply_audio_decoder(batch)                                                                                                            
  File "/parler-tts/./training/run_parler_tts_training_local.py", line 415, in apply_audio_decoder                                    
    labels = audio_decoder.encode(**batch, bandwidth=bandwidth)["audio_codes"]                                                                              
  File "/parler-tts/parler_tts/dac_wrapper/modeling_dac.py", line 87, in encode
    _, encoded_frame, _, _, _ = self.model.encode(frame, n_quantizers=n_quantizers)                                                                         
  File "dac/model/dac.py", line 243, in encode
    z = self.encoder(audio_data)                                              
  File "torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)                                   
  File "torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)                                      
  File "dac/model/dac.py", line 91, in forward
    return self.block(x)                                                      
  File "torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)                                   
  File "torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)                                      
  File "torch/nn/modules/container.py", line 217, in forward
    input = module(input)                                                     
  File "torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)                                   
  File "torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)                                      
  File "dac/model/dac.py", line 61, in forward
    return self.block(x)                                                      
  File "torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)                                   
  File "torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)                                      
  File "torch/nn/modules/container.py", line 217, in forward
    input = module(input)                                                     
  File "torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)                                   
  File "torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)      
        return forward_call(*args, **kwargs)                                      
  File "torch/nn/modules/container.py", line 217, in forward
    input = module(input)                                                     
  File "torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)                                   
  File "torch/nn/modules/module.py", line 1582, in _call_impl
    result = forward_call(*args, **kwargs)                                    
  File "torch/nn/modules/conv.py", line 310, in forward
    return self._conv_forward(input, self.weight, self.bias)                                                                                                
  File "torch/nn/modules/conv.py", line 306, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,                                                                                                       
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 18.09 GiB. GPU

[Question] Does this Mini model can be train on small GPU 24GB x 2?

Hi,

I have dual gpu 3090 x 2, when I run the training script, I always get OOM, do I must configure the parameter? What parameter should I adjust?

Flash Attention Support

Is there any way the Flash Attention 2 support for this model? if there is a way to do it i would love to get involved and help out!

I've tried implement by looking at MusicGen's one but performance is not so difference

Looking for a way to combine spoken words with timestamps in output dictionary

Would it be possible to combine words with timestamps and perhaps return optionally dict with audio tensor and transcription mapping?

Unable to get it to run on the gpu.

Hey, I was trying to run the code on a virtual python env, and the tts doesn't seem to use the gpu on my system.

Do we need to have cuda toolkit installed for the gpu to be used?

Provide an OpenAI compatible web-server that can be deployed using Docker

Hey, I've just created parler-tts-server which is an OpenAI compatible web-server that can be deployed using Docker wrapping parler-tts. I'd like to create a pull-request, modifying the "Usage" section with the instructions on how to run the model with Docker using parler-tts-server. Would that be a welcome change?

I would also be open to creating a pull-request adding the parler-tts-server into this repo!

Could this theoretically be retrained from scratch to generate singing vocals?

Given a 10k hour dataset of singing vocals (instead of the current audiobook reading content), could this model be ported to be able to sing/generate vocals?

LM Output Vocab Size and Extra Tokens

Thanks for the fantastic work on this project!

I'm reviewing your code and noticed something about the language model's output vocabulary size. It seems to be set to encodec_vocab_size + 64. However, my understanding is that the LM generates tokens only from the encodec vocabulary, along with a single end_of_sequence token. If that's correct, wouldn't the necessary vocabulary size be encodec_vocab_size + 1?

I'm curious about the purpose of the additional 63 tokens in the output vocabulary. Could you please clarify why this is the case?

Trouble pronouncing dates

I found that the model (here: the Jenny model, but I found the same issue with ParlerTTS mini) seems to have trouble pronouncing years and numbers. For example:

"The Crusaders marched through Eastern Europe, gathering support and supplies along the way, before reaching Constantinople in 1097."

TTS_stumbles_on_numbers_00001-audio.mp4

Add to HF Pipeline

Hi,
Would be nice to be able to use this using the text-to-speech pipeline.
Thanks!

How to predict after finetune? There is no config.json in checkpoint dir.

[Inference] Set do_sample=False disrupt the generation

I tried to output the same voice for consistency, so I set the do_sample=False. However, the output is basically noise. Here is my code:

prompt = "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent."
description = "A female speaker delivers her words quite expressively, in a very confined sounding environment with clear audio quality."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids, max_length=2580, num_beams=1, num_beam_groups=1, do_sample=False)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parle_tts_marco.wav", audio_arr, model.config.sampling_rate)

Poor quality when batch inferencing

code as below:

prompt1 = "Hey, how are you doing today?"
prompt2 = "Hey, good."
description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."

input_ids = tokenizer([description, description], return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer([prompt1, prompt2], padding=True, truncation=True,  return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
for i in range(2):
    sf.write(f"{i}_parler_tts_out.wav", audio_arr[i].squeeze(), model.config.sampling_rate)

The result seems not stable.

Descript Audio Codec selection

The current model selected is the 44khz model, and there are 9 codebooks here. However, according to the test at Codec-SUPERB, it can be seen that the decoder of the 24khz model with 32 codebooks performs better. Is it possible to replace 9 codebooks with 32 codebooks? If so, will it be difficult to train a decoder-only model?

Is there a way to create consistent voices?

I want to make an app that would read long texts in chunks. For this I need to get the same voice for the same speaker prompt. Now I get similar but still not the same voices each generation. Is it possible to somehow fix the voice?

Benchmarks of parler-tts, the emergence of TTS!

Hey @sanchit-gandhi, like the repo. Excited to see this being worked on. Here's a benchmark of WhisperSpeech. I used your sample script on the same exact text snippet and it finished processing in 16.04 seconds. However, this repo is in float32 while I think WhisperSpeech is being run in float16. Can you provide me with the modification to run in float16 or bfloat16 even? I'm going to do a comparison of this, Bark, and WhisperSpeech:

I want to add that this says nothing about the quality, only speed. I'll evaluate quality next after I ensure comparable testing procedures regarding compute time. Here's the script I used:

SCRIPT HERE

import time
import sounddevice as sd
import torch
from transformers import AutoTokenizer
from parler_tts import ParlerTTSForConditionalGeneration

# Setup device
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Load model and tokenizer
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1")

# Prepare input
prompt = "This script processes a body of text one sentence at a time and plays them consecutively. This enables the audio playback to begin sooner instead of waiting for the entire body of text to be processed. The script uses the threading and queue modules that are part of the standard Python library. It also uses the sound device library, which is fairly reliable across different platforms. I hope you enjoy, and feel free to modify or distribute at your pleasure."
description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."

# Start timer
start_time = time.time()

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()

# End timer
end_time = time.time()
processing_time = end_time - start_time

# Print processing time in green
print(f"\033[92mProcessing time: {processing_time:.2f} seconds\033[0m")

sampling_rate = model.config.sampling_rate
sd.play(audio_arr, samplerate=sampling_rate)
sd.wait()

Lastly, let me know what other "speedups" I can use such as bettertransformer, which I think is part of torch now unless I'm mistaken. I can't test FA2 unless you help me install it. I've tried.

Running gets output like this...failure

(base) gwen@GwenSeidr:/2/parler-tts$ virtualenv parler_tts_env
created virtual environment CPython3.10.12.final.0-64 in 328ms
creator CPython3Posix(dest=/home/gwen/2/parler-tts/parler_tts_env, clear=False, no_vcs_ignore=False, global=False)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/gwen/.local/share/virtualenv)
added seed packages: GitPython==3.1.43, Jinja2==3.1.3, Markdown==3.6, MarkupSafe==2.1.5, PyYAML==6.0.1, absl_py==2.1.0, accelerate==0.29.2, aiohttp==3.9.4, aiosignal==1.3.1, appdirs==1.4.4, argbind==0.3.7, asttokens==2.4.1, async_timeout==4.0.3, attrs==23.2.0, audioread==3.0.1, certifi==2024.2.2, cffi==1.16.0, charset_normalizer==3.3.2, click==8.1.7, contourpy==1.2.1, cycler==0.12.1, datasets==2.18.0, decorator==5.1.1, descript_audio_codec==1.0.0, descript_audiotools==0.7.2, dill==0.3.8, docker_pycreds==0.4.0, docstring_parser==0.16, einops==0.7.0, evaluate==0.4.1, exceptiongroup==1.2.0, executing==2.0.1, ffmpy==0.3.2, filelock==3.13.4, fire==0.6.0, flatten_dict==0.4.2, fonttools==4.51.0, frozenlist==1.4.1, fsspec==2024.2.0, future==1.0.0, gitdb==4.0.11, grpcio==1.62.1, huggingface_hub==0.22.2, idna==3.7, importlib_resources==6.4.0, ipython==8.23.0, jedi==0.19.1, jiwer==3.0.3, joblib==1.4.0, julius==0.2.7, kiwisolver==1.4.5, lazy_loader==0.4, librosa==0.10.1, llvmlite==0.42.0, markdown2==2.4.13, markdown_it_py==3.0.0, matplotlib==3.8.4, matplotlib_inline==0.1.6, mdurl==0.1.2, mpmath==1.3.0, msgpack==1.0.8, multidict==6.0.5, multiprocess==0.70.16, networkx==3.3, numba==0.59.1, numpy==1.26.4, nvidia_cublas_cu12==12.1.3.1, nvidia_cuda_cupti_cu12==12.1.105, nvidia_cuda_nvrtc_cu12==12.1.105, nvidia_cuda_runtime_cu12==12.1.105, nvidia_cudnn_cu12==8.9.2.26, nvidia_cufft_cu12==11.0.2.54, nvidia_curand_cu12==10.3.2.106, nvidia_cusolver_cu12==11.4.5.107, nvidia_cusparse_cu12==12.1.0.106, nvidia_nccl_cu12==2.19.3, nvidia_nvjitlink_cu12==12.4.127, nvidia_nvtx_cu12==12.1.105, packaging==24.0, pandas==2.2.2, parler_tts==0.1, parso==0.8.4, pexpect==4.9.0, pillow==10.3.0, pip==24.0, platformdirs==4.2.0, pooch==1.8.1, prompt_toolkit==3.0.43, protobuf==3.19.6, psutil==5.9.8, ptyprocess==0.7.0, pure_eval==0.2.2, pyarrow==15.0.2, pyarrow_hotfix==0.6, pycparser==2.22, pygments==2.17.2, pyloudnorm==0.1.1, pyparsing==3.1.2, pystoi==0.4.1, python_dateutil==2.9.0.post0, pytz==2024.1, randomname==0.2.1, rapidfuzz==3.8.1, regex==2023.12.25, requests==2.31.0, responses==0.18.0, rich==13.7.1, safetensors==0.4.2, scikit_learn==1.4.2, scipy==1.13.0, sentencepiece==0.2.0, sentry_sdk==1.45.0, setproctitle==1.3.3, setuptools==69.2.0, six==1.16.0, smmap==5.0.1, soundfile==0.12.1, soxr==0.3.7, stack_data==0.6.3, sympy==1.12, tensorboard==2.16.2, tensorboard_data_server==0.7.2, termcolor==2.4.0, threadpoolctl==3.4.0, tokenizers==0.15.2, torch==2.2.2, torch_stoi==0.2.1, torchaudio==2.2.2, tqdm==4.66.2, traitlets==5.14.2, transformers==4.39.3, triton==2.2.0, typing_extensions==4.11.0, tzdata==2024.1, urllib3==2.2.1, wandb==0.16.6, wcwidth==0.2.13, werkzeug==3.0.2, wheel==0.43.0, xxhash==3.4.1, yarl==1.9.4
activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
(base) gwen@GwenSeidr:/2/parler-tts$ source parler_tts_env/bin/activate
(parler_tts_env) (base) gwen@GwenSeidr:/2/parler-tts$ source parler_tts_env/bin/activate
(parler_tts_env) (base) gwen@GwenSeidr:/2/parler-tts$ python helpers/model_init_scripts/init_model_600M.py ./parler-tts-untrained-600M --text_model "google/flan-t5-base" --audio_model "parler-tts/dac_44khZ_8kbps"
num_codebooks 9
/home/gwen/2/parler-tts/parler_tts_env/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Removed shared tensor {'text_encoder.encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading

run_parler_tts_training.py gives datasets.table.CastError error and failure

I've run through the steps to train a single voice, and it goes well until it comes time to actually Fine-tuning Parler-TTS step, i'm hitting a wall. It seems the previous Dataset Annotation instructions don't create all of the expected values?

File "/Users/durin/AI/Projects/.parler-env/lib/python3.12/site-packages/datasets/table.py", line 2249, in cast_table_to_schema raise CastError( datasets.table.CastError: Couldn't cast text: string utterance_pitch_mean: float utterance_pitch_std: float snr: double c50: double speaking_rate: string phonemes: string noise: string reverberation: string speech_monotony: string -- schema metadata -- huggingface: '{"info": {"features": {"text": {"dtype": "string", "_type":' + 502 to {'text': Value(dtype='string', id=None), 'utterance_pitch_mean': Value(dtype='float32', id=None), 'utterance_pitch_std': Value(dtype='float32', id=None), 'snr': Value(dtype='float64', id=None), 'c50': Value(dtype='float64', id=None), 'speaking_rate': Value(dtype='string', id=None), 'phonemes': Value(dtype='string', id=None), 'noise': Value(dtype='string', id=None), 'reverberation': Value(dtype='string', id=None), 'speech_monotony': Value(dtype='string', id=None), 'audio': Audio(sampling_rate=44100, mono=True, decode=True, id=None)} because column names don't match

Just to confirm. when I run the previous step to view a sample from the dataset, here's the full contents:
{'text': " Tonight at 11 on Utah's Talk Radio.", 'utterance_pitch_mean': 114.15099334716797, 'utterance_pitch_std': 30.472904205322266, 'snr': 60.48295974731445, 'c50': 59.44480895996094, 'speaking_rate': 'quite slowly', 'phonemes': " tʌnaɪt æt ɑn jutɔ'ɛs tɔk ɹeɪdioʊ . .", 'noise': 'slightly clear', 'reverberation': 'very confined sounding', 'speech_monotony': 'quite monotone', 'text_description': "'Very clear recording, but the speech is very monotone and slightly muffled by the recording.'"}

Not sure what I might be doing wrong, and I won't pretend to be an expert at this, so any guidance would be appreciated.

Won't work

first of all congrats on your accomplishments!

but I must be doing something wrong but I can't get it to work,
I want to install it in my textgenwebui environment but I get this error:

C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\site-packages\torch\nn\utils\weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.") You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers Using the model-agnostic default max_length(=2580) to control the generation length. We recommend settingmax_new_tokensto control the maximum length of the generation. Callingsampledirectly is deprecated and will be removed in v4.41. Usegenerate or a custom generation loop instead. --- Logging error --- Traceback (most recent call last): File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\logging\__init__.py", line 1110, in emit msg = self.format(record) ^^^^^^^^^^^^^^^^^^^ File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\logging\__init__.py", line 953, in format return fmt.format(record) ^^^^^^^^^^^^^^^^^^ File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\logging\__init__.py", line 687, in format record.message = record.getMessage() ^^^^^^^^^^^^^^^^^^^ File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\logging\__init__.py", line 377, in getMessage msg = msg % self.args ~~~~^~~~~~~~~~~ TypeError: not all arguments converted during string formatting Call stack: File "C:\text-generation-webui-snapshot-2024-04-21\snippet.py", line 17, in <module> generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids) File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\site-packages\parler_tts\modeling_parler_tts.py", line 2608, in generate outputs = self.sample( File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\site-packages\transformers\generation\utils.py", line 2584, in sample return self._sample(*args, **kwargs) File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\site-packages\transformers\generation\utils.py", line 2730, in _sample logger.warning_once( File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\site-packages\transformers\utils\logging.py", line 329, in warning_once self.warning(*args, **kwargs) Message: 'eos_token_idis deprecated in this function and will be removed in v4.41, usestopping_criteria=StoppingCriteriaList([EosTokenCriteria(eos_token_id=eos_token_id)])instead. Otherwise make sure to setmodel.generation_config.eos_token_id' Arguments: (<class 'FutureWarning'>,)

It is super vague, and I don't know where to look next.
My current version of python is 3.11, torch 2.2.1+cu121, transformers 4.40.0

can anyone point me in the right direction?
thanks for your time!

pls use .mp3 in soundfile output

pls put the .mp3 extension in Usage soundfile output example, i.e.

sf.write("parler-tts.mp3",...

Soundfile docs say MP3 is supported since 202206 and it doesn't seem to be responsible for the ** when the prompt is longer than a sentence or two.
Regards
G.

Zero-Shot Voice Cloning

Hi,
I know this library is primarily for text -> voice but do you know if it would be possible to modify it to accept a speaker embedding and perform zero-shot voice cloning?
Thanks!

How to do with flan-t5 when i want to finetune based on Mini v0.1 but not from scratch? Flan t5 can not deal my language.

Fine-Tuning colab error

Hello, I'm trying to run the exact fine-tuning script in the colab notebook. But there's an error on the cell below with the following output.

!python main.py "ylacombe/jenny-tts-6h" \
  --configuration "default" \
  --text_column_name "transcription" \
  --audio_column_name "audio" \
  --cpu_num_workers 2 \
  --num_workers_per_gpu_for_pitch 2 \
  --rename_column \
  --repo_id "jenny-tts-tags-6h"

main.py 52 <module>
snr_dataset = dataset.map(

dataset_dict.py 869 map
{

dataset_dict.py 870 <dictcomp>
k: dataset.map(

arrow_dataset.py 602 wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)

arrow_dataset.py 567 wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)

arrow_dataset.py 3156 map
for rank, done, content in Dataset._map_single(**dataset_kwargs):

arrow_dataset.py 3547 _map_single
batch = apply_function_on_filtered_inputs(

arrow_dataset.py 3416 apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)

snr_and_reverb.py 22 snr_apply
pipeline = RegressiveActivityDetectionPipeline(segmentation=model)

pipeline.py 72 __init__
self._frames = self._segmentation.model.example_output.frames

module.py 1688 __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")

AttributeError:
'CustomPyanNetModel' object has no attribute 'example_output'

GPU parallel

Hi,
is there an option to train the model on multiple GPUs?

If not, is there an easy way to addapt it?

tnx!

[show and tell] apple mps support

with newer pytorch (2.4 nightly) we get bfloat16 support in MPS.

i tested this:

from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
import torch

device = "mps:0"

model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1").to(device=device, dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1")

prompt = "welcome to huggingface"
description = "An old man."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device=device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device=device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.to(torch.float32).cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

Does it support Chinese?

Save checkpoints as usable models

Hey everyone,

I am trying to fine-tune a model. I ran into overfitting after some training. Now I want to save a previous checkpoint as my model. As far is a can see, you are using safetensor models when using the "ParlerTTSForConditionalGeneration.from_pretrained()" method.

I cannot find an easy way how to load and save a checkpoint without starting a new training. Do you have any suggestions?

Thank you :)

sampling rate issue

Great work!

When running the DAC token extraction stage of the training script with the default hyperparams, I got warning:

It is strongly recommended to pass the sampling_rate argument to this function. Failing to do so can result in silent errors that might be hard to debug.

I checked the feature_extractor.sampling_rate which got passed to load_multiple_datasets and it's indeed 44100Hz.

Just want to make sure this is expected.

Thanks!

Custom pronunciation for words - any thoughts / recommendations about how best to handle them?

Hello! This is a really interesting looking project.

Currently there doesn't seem any way that users can help the model correctly pronounce custom words - for instance JPEG is something that speakers just need to know is broken down as "Jay-Peg" rather than Jay-Pea-Ee-Gee.

I appreciate this project is at an early stage but for practical uses, especially with brands and product names often having quirky ways of saying words or inventing completely new words, it's essential to be able to handle their correct pronunciation on some sort of override basis. It's not just brands - plenty of people's names need custom handling and quite a few novel computer words are non-obvious too.

Examples that cause problems in the current models: Cillian, Joaquin, Deirdre, Versace, Tag Heuer, Givenchy, gigabytes, RAM, MPEG etc.

Are there any suggestions on how best to tackle this?

I saw there was #33 which uses a normaliser specifically for numbers. Is there something similar for custom words? I suppose perhaps one could drop in a list of custom words and some sort of mapping to the desired pronunciation, applying that as a stage similar to how it handles abbreviations.

In espeak backed tools, it's sometimes possible to replace words with custom IPA that replaces the default IPA generated but I believe this model doesn't use IPA for controlling pronunciation.

Given the frequently varying pronunciations, I doubt that simply finetuning to include the words would be a viable approach.

Anyway, would be great to hear what others have to recommend.

Incidentally certain mainstream terms also get completely garbled, it seems impossible to get Instagram, Linux or Wikipedia to be spoken properly, but that's more a training data issue and those are mainstream enough that you wouldn't need to cover them via custom overrides.

Tts parler for tfs

Need the abillity to save/re-use a generated voice

We use TTS in an eLearning environment where we generate hundreds of videos per year. All of these videos must use the same exact voice for consistency.

To use Parler-TTS I'd need to be able to generate a voice (based upon a description), save it, then use it across multiple TTS sessions. We currently use Google's TTS api which allows me to select from a list of voices so that all of my TTS audio sounds exactly like the same speaker.

Streaming support?

Is there any streaming support for this model? if there is a way to do it i would love to get involved and help out!

Error: Assertion `srcIndex < srcSelectDimSize` failed

I initialized models from scratch, because flan-t5 doesn't support my language:
!python helpers/model_init_scripts/init_dummy_model.py ./parler-tts-untrained-dummy_1 --text_model "google-t5/t5-small" --audio_model "parler-tts/dac_44khZ_8kbps"
!python helpers/model_init_scripts/init_dummy_model.py ./parler-tts-untrained-dummy_2 --text_model "google/flan-t5-base" --audio_model "parler-tts/dac_44khZ_8kbps"

when train I get the same error:

Step... (100 / 8280 | Loss: 6.848744869232178, Learning Rate: 1.6000000000000003e-05)
Step... (200 / 8280 | Loss: 6.750528812408447, Learning Rate: 3.2000000000000005e-05)
Step... (300 / 8280 | Loss: 6.650070667266846, Learning Rate: 4.8e-05)
Step... (400 / 8280 | Loss: 6.698696613311768, Learning Rate: 6.400000000000001e-05)
Step... (500 / 8280 | Loss: 6.726866245269775, Learning Rate: 8e-05)
Train steps ... : 6%|█▏ | 500/8280 [05:34<1:28:30, 1.47it/s]

Evaluating - Inference ...: 0%| | 0/1 [00:00<?, ?it/s]

Evaluating - Inference ...: 100%|█████████████████| 1/1 [00:00<00:00, 3.24it/s]

Evaluating - Generation ...: 0%| | 0/1 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [2,0,0], thread: [96,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [2,0,0], thread: [97,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [2,0,0], thread: [98,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [2,0,0], thread: [99,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [2,0,0], thread: [100,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [2,0,0], thread: [101,0,0] Assertion srcIndex < srcSelectDimSize failed.

How could I make this work in spanish

It should be very nice if it works also in spanish, I know that it could be possible, but I dont know how to do it with me technical knowledge...

Long-form synthesis

Hi,
Congrats on the release!! Is long form synthesis planned?
Thank you!

[Question] "I got strongly recommended to pass the `sampling_rate` argument to this function... " Is this expected?

Hi,

I'm trying to retrain the model, and I got this message when the process of Encode the audio samples

"strongly recommended to pass the sampling_rate argument to this function." is this expected?

Then after like 8% It stopped with error Signal 11 (SIGSEGV) received ,

Is there any clue?

really appreciate your help 🙏

[rank1]: RuntimeError: DataLoader worker (pid 66405) is killed by signal: Segmentation fault.
W0507 11:19:08.585000 134185846527040 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 65464 closing signal SIGTERM
E0507 11:19:09.022000 134185846527040 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 1 (pid: 65465) of binary: /home/ys/anaconda3/envs/parler/bin/python
Traceback (most recent call last):

...
...

=======================================================
training/run_parler_tts_training.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-07_11:19:08
  host      : trainer
  rank      : 1 (local_rank: 1)
  exitcode  : -11 (pid: 65465)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 65465
=======================================================

error regarding some tokenizer issue

When I run the sample script I keep getting this error message among others...not sure how dire it is or whether it even impacts performance...

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers

Feature improvements

Hello, I tried the Parler-TTS Mini model and it exceeded my expectations with very good results.

However, I have some questions and possible suggestions for improvement:

Will there be a multi-lingual version available, such as support for Mandarin?
Currently, the accuracy of numbers and punctuation marks is not very good, and there are instances where words are dropped in sentences. Will these issues be addressed in future versions?

TypeError: iteration over a 0-d tensor

I tried it on collab, I tried it on my hardware, and in both cases this error occurred during training.

gathered_tensor tensor([0], device='cuda:0')
06/15/2024 17:04:25 - INFO - __main__ - *** Encode target audio with encodec ***
 96%|██████████████████████████████████████████████████████████▌  | 24/25 [00:54<00:02,  2.27s/it]
Traceback (most recent call last):
  File "/home/lab/parler-tts/./training/run_parler_tts_training.py", line 1032, in <module>
    main()
  File "/home/lab/parler-tts/./training/run_parler_tts_training.py", line 437, in main
    lab = [l[:, : int(ratio * length)] for (l, ratio, length) in zip(lab, rat, lens)]
  File "/home/lab/parler-tts/venv/lib/python3.10/site-packages/torch/_tensor.py", line 1047, in __iter__
    raise TypeError("iteration over a 0-d tensor")
TypeError: iteration over a 0-d tensor
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/lab/parler-tts/wandb/offline-run-20240615_170319-y3l2zohb
wandb: Find logs at: ./wandb/offline-run-20240615_170319-y3l2zohb/logs
Traceback (most recent call last):
  File "/home/lab/parler-tts/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/lab/parler-tts/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/lab/parler-tts/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/home/lab/parler-tts/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 703, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
    ```

huggingface / parler-tts Goto Github PK

parler-tts's People

Stargazers

Watchers

Forkers

parler-tts's Issues

Recommend Projects

Recommend Topics

Recommend Org